Reputation
Badges 1
979 × Eureka!Thats how I would do it, maybe guys from allegro-ai can come up with a better approach š
Very cool! Run two train-agent daemons, one per GPU on the same machine, with default Nvidia/CUDA Docker
This is close to my use case, I just would like to run these two daemons not with docker, would that be possible? I should just remove the --docker nvidia/cuda
param right?
trains-agent daemon --gpus 0 --queue default & trains-agent daemon --gpus 1 --queue default &
Ho the object is actually available in previous_task.artifacts
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
Alright, how can I then mount a volume of the disk?
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
interestingly, it works on one machine, but not on another one
Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))
The fileĀ /tmp/.clearml_agent_out.j7wo7ltp.txt
Ā does not exist
ho wait, actually I am wrong
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didnāt change anything else
So two possible cases for trains-agent-1: either:
It picks a new experiment -> show randomly one of the two experiments in the "workers" tab no new experiment in default queue to start -> show randomly no experiment or the one that it is running
Thanks a lot for the solution SuccessfulKoala55 ! Iāll try that if the solution ādelete old bucket, wait for its name to be available, recreate it with the other aws account, transfer the data backā fails
SuccessfulKoala55 I was able to make it work with use_credentials_chain: true
in the clearml.conf and the following patch: https://github.com/allegroai/clearml/pull/478
same as the first one described
So it looks like the agent, from time to time thinks it is not running an experiment
When an experiment on trains-agent-1 is finished, I see randomly no experiment/long experiment and when two experiments are running, I see randomly one of the two experiments
by mistake I have two agents started in one machine
the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine
Hi @<1523701205467926528:profile|AgitatedDove14> @<1537605940121964544:profile|EnthusiasticShrimp49> , the issue above seemed to be the memory leak and it looks like there is no problem from clearml side.
I trained successfully without mem leak with num_workers=0 and I am now testing with num_workers=8.
Sorry for the false positive :man-bowing:
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
Early debugging signals show that auto_connect_frameworks={'matplotlib': False, 'joblib': False}
seem to have a positive impact - it is running now, I will confirm in a bit