Reputation
Badges 1
979 × Eureka!I want to make sure that an agent did finish uploading its artifacts before marking itself as complete, so that the controller does not try to access these artifacts while they are not available
No, I want to launch the second step after the first one is finished and all its artifacts are uploaded
Yes π Thanks!
Is it because I did not specify --gpu 0
that the agent, by default pulls one experiment per available GPU?
Ok, in that case it probably doesnβt work, because if the default value is 10 secs, it doesnβt match what I get in the logs of the experiment: every second the tqdm adds a new line
trains-agent-1: runs an experiment for a long time (>12h). Picks a new experiment on top of the long one running trains-agent-2: runs only one experiment at a time, normal trains-agent-3: runs only one experiment at a time, normalIn total: 4 experiments running for 3 agents
SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
and then call task.connect_configuration probably
You mean you "aborted the task" from the UI?
Yes exactly
I'm assuming from the leftover processes ?
Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why
From the log I see the agent is running in venv mode
Hmm please try with the latest clearml-agent (the others should not have any effect)
yes in venv mode, I'll try with the latest version as well
@<1523701205467926528:profile|AgitatedDove14> I see other rc in pypi but no corresponding tags in the clearml-agent repo? are these releases legit?
What is latest rc of clearml-agent? 1.5.2rc0?
SuccessfulKoala55 I am using ES 7.6.2
Hi @<1523701205467926528:profile|AgitatedDove14> , I want to circule back on this issue. This is still relevant and I could collect the following on an ec2 instance running a clearml-agent running a stuck task:
- There seems to be a problem with multiprocessing: Although I stopped the task, there are still so many processes forked from the main training process. I guess these are zombies. Please check the htop tree.
- There is a memory leak somewhere, please see the screenshot of datadog mem...
Hi SuccessfulKoala55 , thanks for the idea! the function isnβt called with atexit.register() though, maybe the way the agent kills the task is not supported by atexit
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
SuccessfulKoala55 I deleted all :monitor:machine
and :monitor:gpu
series, but only deleted ~20M documents out of 320M documents in the events-training_debug_image-xyz
. I would like now to understand which experiments contain most of the document to delete them. I would like to aggregate the number of document per experiment. Is there a way do that using the ES REST api?
I ended up dropping omegaconf altogether
Yea, the config is not appearing in the webUI anymore with this method π
Hi TimelyPenguin76 , any chance this was fixed already? π
Hi TimelyPenguin76 , any chance this was fixed? π
Thanks, the message is not logged in GCloud instances logs when using startup scripts, this is why I did not see it. π