Reputation
Badges 1
979 × Eureka!Yes, I guess that's fine then - Thanks!
I’ll definitely check that out! 🤩
Ho wow! is it possible to not specify a remote task? (If i am working with Task.set_offline(True))
the instances takes so much time to start, like 5 mins
btw I see in the pytorch_distributed_example I see that you average_gradients
, but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.
Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? It’s blocking me atm
On clearml or clearml-server?
I wouldn't do it, this is less code to maintain from your side and honestly too much auto magic makes it difficult for the user to control the environment (ie. to understand what happens behind the scenes). I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible
So I suppose clearml-agent is not responsible, because it finds a wheel for torch 1.11.0 with cu117. It just happens that this wheel doesn't work in ec2 g5 instances suprizingly. Either I'll hardcode the correct wheel or I'll upgrade torch to 1.13.0
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
@<1537605940121964544:profile|EnthusiasticShrimp49> I'll try setting the cuda version clearml.conf, thanks for the tip!
@<1523701205467926528:profile|AgitatedDove14> Could you please push the code for that version on github?
AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample 🤩
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc
I opened an https://github.com/pytorch/ignite/issues/2343 in ignite’s repo and a https://github.com/pytorch/ignite/pull/2344 , could you please have a look? There might be a bug in clearml Task.init
in distributed envs
So probably only the main process (rank=0) should attach the ClearMLLogger?
Just tested locally, in terminal its the same: with the hack it works, without the hack it doesn't show the logger messages
(I didn't have this problem so far because I was using ssh keys globaly, but I want know to switch to git auth using Personal Access Token for security reasons)
Notice the last line should not have
--docker
Did you meant --detached
?
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)
That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR 🙂
(by console you mean in the dashboard right? or the terminal?)
I finally found a workaround using cache, will detail the solution in the issue 👍
AgitatedDove14 So in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping class I see that some infos are logged (in the __call__
function), and I would like to have these infos logged by clearml
See my answer in the issue - I am not using docker
both are repos for python modules (experiment one and dependency of the experiment)
Yes, that's what it looks like. Somehow when you clone the experiment repo, you correctly set the git creds in the url, but when the dependencies are installed, the git creds are not taken in account
Would be very cool if you could include this use case!
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler