Reputation
Badges 1
981 × Eureka!I was able to fix by applying for a license and registering it
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
Isn't it overkill to run a whole ubuntu 18.04 just to run a dead simple controller task?
amazon linux
I am looking for a way to gracefully stop the task (clean up artifacts, shutdown backend service) on the agent
(I didn't have this problem so far because I was using ssh keys globaly, but I want know to switch to git auth using Personal Access Token for security reasons)
I ended up dropping omegaconf altogether
Not sure about that, I think you guys solved it with your PipelineController implementation. I would need to test it before giving any feedback π
I will go for lunch actually π back in ~1h
Will it freeze/crash/break/stop the ongoing experiments?
Ok yes, I get it, this info is also available at the very beginning of the logs, where the agent logs the full docker run command, this docker_cmd is a shorter version?
Bottom line is: trains-server uses elastichsearch image: http://docker.elastic.co/elasticsearch/elasticsearch:5.6.16 which does not have an unlimited license (only free license that expires after some time). From versions 6.3, elasticsearch provides an unlimited free license. Trains should use >=6.3, WDYT?
I checked the commit date anch and went to all experiments, and scrolled until finding the experiment
AgitatedDove14 Yes I have the xpack security disabled, as in the link you shared (note that its xpack.security.enabled: "false" with brackets around false), but this command throws:
{"error":{"root_cause":[{"type":"parse_exception","reason":"request body is required"}],"type":"parse_exception","reason":"request body is required"},"status":400}
might be worth documenting π
I think this is because this API is not available in elastic 5.6
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
No space, I will add and test π
CostlyOstrich36 Were you able to reproduce it? Thatβs rather annoying π
so most likely one hard requirement installs the version 2 of pyjwt while setting up the experiment
I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
Also maybe we are not on the same page - by clean up, I mean kill a detached subprocess on the machine executing the agent
` Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.toc3_yks.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.1dsz4bz8:/root/.ssh', '-v', '/home/user/.clearml/apt-cache.2:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '...
I see 3 agents in the "Workers" tab
AgitatedDove14 Is it fixed with trains-server 0.15.1?
so that any error that could arise from communication with the server could be tested
Default would be venv, only use docker if an image is passed. Use case: not have to duplicate all queues to accept both docker and venv agents on the same instances