Reputation
Badges 1
979 × Eureka!So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117. It just happens that this wheel doesn't work in ec2 g5 instances suprizingly. Either I'll hardcode the correct wheel or I'll upgrade torch to 1.13.0
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
@<1537605940121964544:profile|EnthusiasticShrimp49> I'll try setting the cuda version clearml.conf, thanks for the tip!
@<1523701205467926528:profile|AgitatedDove14> Could you please push the code for that version on github?
AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample ๐คฉ
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc
I opened an https://github.com/pytorch/ignite/issues/2343 in igniteโs repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init
in distributed envs
So probably only the main process (rank=0) should attach the ClearMLLogger?
AgitatedDove14 I was able to redirect the logger by doing so:clearml_logger = Task.current_task().get_logger().report_text early_stopping = EarlyStopping(...) early_stopping.logger.debug = clearml_logger early_stopping.logger.info = clearml_logger early_stopping.logger.setLevel(logging.DEBUG)
Just tested locally, in terminal its the same: with the hack it works, without the hack it doesn't show the logger messages
(I didn't have this problem so far because I was using ssh keys globaly, but I want know to switch to git auth using Personal Access Token for security reasons)
Notice the last line should not have
--docker
Did you meant --detached
?
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)
That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR ๐
(by console you mean in the dashboard right? or the terminal?)
I finally found a workaround using cache, will detail the solution in the issue ๐
AgitatedDove14 So in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping class I see that some infos are logged (in the __call__
function), and I would like to have these infos logged by clearml
See my answer in the issue - I am not using docker
both are repos for python modules (experiment one and dependency of the experiment)
Yes, that's what it looks like. Somehow when you clone the experiment repo, you correctly set the git creds in the url, but when the dependencies are installed, the git creds are not taken in account
Would be very cool if you could include this use case!
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler
No idea, I also would have expected it to be automatically logged as console output ๐ค
with my hack yes, without, no
The only thing that changed is the new auth.fixed_users.pass_hashed
field, that I donโt have in my config file
This is what I get, when I am connected and when I am logged out (by clearing cache/cookies)
AgitatedDove14 I think itโs on me to take the pytorch distributed example in the clearml repo and try to reproduce the bug, then pass it over to you ๐
For the moment this is what I would be inclined to believe