Reputation
Badges 1
979 × Eureka!Hi AnxiousSeal95 , I hope you had nice holidays! Thanks for the update! I discovered h2o when looking for ways to deploy dashboards with apps like streamlit. Most likely I will use either streamlit deployed through clearml or h2o as standalone if ClearML won't support deploying apps (which is totally fine, no offense there 🙂 )
Setting it after the training correctly updated the task and I was able to store artifacts remotely
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
I also tried setting ebs_device_name = "/dev/sdf"
- didn't work
I think waiting for the apt locks to be released with something like this would workstartup_bash_script = [ "#!/bin/bash", "while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done", "sudo apt-get update", ...
Weirdly this throws an error in the autoscaler:
` Spinning new instance type=v100_spot
Error: Failed to start new instance, unexpected '{' in field...
Thats how I would do it, maybe guys from allegro-ai can come up with a better approach 👍
Very cool! Run two train-agent daemons, one per GPU on the same machine, with default Nvidia/CUDA Docker
This is close to my use case, I just would like to run these two daemons not with docker, would that be possible? I should just remove the --docker nvidia/cuda
param right?
trains-agent daemon --gpus 0 --queue default & trains-agent daemon --gpus 1 --queue default &
Ho the object is actually available in previous_task.artifacts
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
you mean “docker” was not installed and it did not throw an error ?
Yes docker was not installed in the machine
Yes you must make sure the docker can mount a persistent folder for you to work on.
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
Alright, how can I then mount a volume of the disk?
` Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.toc3_yks.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.1dsz4bz8:/root/.ssh', '-v', '/home/user/.clearml/apt-cache.2:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '...
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
interestingly, it works on one machine, but not on another one
Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))
The file /tmp/.clearml_agent_out.j7wo7ltp.txt
does not exist
I want in my CI tests to reproduce a run in an agent because the env changes and some things break in agents and not locally
ho wait, actually I am wrong
CostlyOstrich36 , actually this only happens for a single agent. The weird thing is that I have a machine with two gpus, and I spawn two agents, one per gpus. Both have the same version. For one, I can see all the logs, but not for the other
No, they have different names - I will try to update both agents to the latest versions
I am sorry to give infos that are not very precise, but it’s the best I can do - Is this bug happening only to me?
I think it comes from the web UI of the version 1.2.0 of clearml-server, because I didn’t change anything else
CostlyOstrich36 Were you able to reproduce it? That’s rather annoying 😅