Reputation
Badges 1
979 × Eureka!the instances takes so much time to start, like 5 mins
btw I see in the pytorch_distributed_example I see that you average_gradients
, but from pytorch https://pytorch.org/tutorials/beginner/dist_overview.html it says:DDP takes care of gradient communication to keep model replicas synchronized and overlaps it with the gradient computations to speed up training.
Hi AgitatedDove14 , How should we proceed to fix this bug? Should I open an issue in github? Should I try to make a minimal reproducible example? Itโs blocking me atm
On clearml or clearml-server?
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt? (edited)
That sounds great!
AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample ๐คฉ
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc
Just tested locally, in terminal its the same: with the hack it works, without the hack it doesn't show the logger messages
(I didn't have this problem so far because I was using ssh keys globaly, but I want know to switch to git auth using Personal Access Token for security reasons)
Notice the last line should not have
--docker
Did you meant --detached
?
I also think we need to make sure we monitor all agents (this is important as this is the trigger to spin down the instance)
That's what I though yea, no problem, it was rather a question, if I encounter the need for that, I will adapt and open a PR ๐
(by console you mean in the dashboard right? or the terminal?)
I finally found a workaround using cache, will detail the solution in the issue ๐
AgitatedDove14 So in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping class I see that some infos are logged (in the __call__
function), and I would like to have these infos logged by clearml
See my answer in the issue - I am not using docker
both are repos for python modules (experiment one and dependency of the experiment)
Yes, that's what it looks like. Somehow when you clone the experiment repo, you correctly set the git creds in the url, but when the dependencies are installed, the git creds are not taken in account
Would be very cool if you could include this use case!
AgitatedDove14 yes but I don't see in the docs how to attach it to the logger of the earlystopping handler
No idea, I also would have expected it to be automatically logged as console output ๐ค
with my hack yes, without, no
The only thing that changed is the new auth.fixed_users.pass_hashed
field, that I donโt have in my config file
This is what I get, when I am connected and when I am logged out (by clearing cache/cookies)
AgitatedDove14 I think itโs on me to take the pytorch distributed example in the clearml repo and try to reproduce the bug, then pass it over to you ๐
For the moment this is what I would be inclined to believe
AgitatedDove14 How can I filter out tasks archived? I don't see this option
is there a command / file for that?