
Reputation
Badges 1
981 × Eureka!the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine
Answering myself: Yes, Task.set_base_docker
RTFM!!!
(by console you mean in the dashboard right? or the terminal?)
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
` Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.toc3_yks.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.1dsz4bz8:/root/.ssh', '-v', '/home/user/.clearml/apt-cache.2:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '...
ho wait, actually I am wrong
Ho and also use the colors of the series. That would be a killer feature. Then I simply need to match the color of the series to the name to check the tags
AgitatedDove14 Good news, I was able to reproduce the bug on the pytorch distributed sample π€©
Here it is > https://github.com/H4dr1en/trains/commit/642c1130ad1f76db10ed9b8e1a4ff0fd7e45b3cc
thanks for your help!
even if I explicitely use previous_task.output_uri = "
s3://my_bucket "
, it is ignored and still saves the json file locally
I also tried task.set_initial_iteration(-task.data.last_iteration)
, hoping it would counteract the bug, didnβt work
BTW, is there any specific reason for not upgrading to clearml?
I just didn't have time so far π
The task requires this service, so the task starts it on the machine - Then I want to make sure the service is closed by the task upon completion/failure/abortion
AppetizingMouse58 the events_plot.json template misses the plot_len
declaration, could you please give me the definition of this field? (reindexing with dynamic: strict
fails with: "mapping set to strict, dynamic introduction of [plot_len] within [_doc] is not allowed
)
Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didnβt think there could be a problem with ESβ¦
Ok, but that means this cleanup code should live somewhere else than inside the task itself right? Otherwise it won't be executed since the task will be killed
I am actually calling later in the start_training
function the following:with idist.Parallel(backend="nccl") as parallel: parallel.run(training_func)
So my backend should be nccl
and not gloo
, right? Not sure how important it is, I read in the https://pytorch.org/docs/stable/distributed.html#which-backend-to-use that nccl should be used for distributed GPU training and gloo for distributed CPU training
I will try with clearml==1.1.5rc2
mmh it looks like what I was looking for, I will give it a try π
Is there any logic on the server side that could change the iteration number?
and the agent says agent.cudnn_version = 0
(Btw the instance listed in the console has no name, it it normal?)