Reputation
Badges 1
981 × Eureka!` Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.toc3_yks.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.1dsz4bz8:/root/.ssh', '-v', '/home/user/.clearml/apt-cache.2:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '...
I see 3 agents in the "Workers" tab
so that any error that could arise from communication with the server could be tested
Default would be venv, only use docker if an image is passed. Use case: not have to duplicate all queues to accept both docker and venv agents on the same instances
Ok, in that case it probably doesnโt work, because if the default value is 10 secs, it doesnโt match what I get in the logs of the experiment: every second the tqdm adds a new line
I would probably leave it to the ClearML team to answer you, I am not using the UI app and for me it worked just well with different regions. Maybe check permissions of the key/secrets?
So probably only the main process (rank=0) should attach the ClearMLLogger?
amazon linux
I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent
(I didn't have this problem so far because I was using ssh keys globaly, but I want know to switch to git auth using Personal Access Token for security reasons)
I ended up dropping omegaconf altogether
Not sure about that, I think you guys solved it with your PipelineController implementation. I would need to test it before giving any feedback ๐
I will go for lunch actually ๐ back in ~1h
Will it freeze/crash/break/stop the ongoing experiments?
Ok yes, I get it, this info is also available at the very beginning of the logs, where the agent logs the full docker run command, this docker_cmd is a shorter version?
I checked the commit date anch and went to all experiments, and scrolled until finding the experiment
AgitatedDove14 Yes I have the xpack security disabled, as in the link you shared (note that its xpack.security.enabled: "false" with brackets around false), but this command throws:
{"error":{"root_cause":[{"type":"parse_exception","reason":"request body is required"}],"type":"parse_exception","reason":"request body is required"},"status":400}
might be worth documenting ๐
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
No space, I will add and test ๐
CostlyOstrich36 Were you able to reproduce it? Thatโs rather annoying ๐
so most likely one hard requirement installs the version 2 of pyjwt while setting up the experiment
I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
AgitatedDove14 any chance you found something interesting? ๐
I want to make sure that an agent did finish uploading its artifacts before marking itself as complete, so that the controller does not try to access these artifacts while they are not available
I am running on bare metal, and cuda seems to be installed at /usr/lib/x86_64-linux-gnu/libcuda.so.460.39