Reputation
Badges 1
979 × Eureka!So I guess the problem is that the following snippet:from clearml import Task Task.init()
Should be added before the if __name__ == "__main__":
?
AgitatedDove14 So I copied pasted locally the https://github.com/pytorch-ignite/examples/blob/main/tutorials/intermediate/cifar10-distributed.py from the examples of pytorch-ignite. Then I added a requirements.txt and called clearml-task
to run it on one of my agents. I adapted a bit the script (removed python-fire since itโs not yet supported by clearml).
UnevenDolphin73 , task = clearml.Task.get_task(clearml.config.get_remote_task_id())
worked, thanks
I also discovered https://h2oai.github.io/wave/ last week, would be awesome to be able to deploy it in the same manner
Hi AnxiousSeal95 , I hope you had nice holidays! Thanks for the update! I discovered h2o when looking for ways to deploy dashboards with apps like streamlit. Most likely I will use either streamlit deployed through clearml or h2o as standalone if ClearML won't support deploying apps (which is totally fine, no offense there ๐ )
Setting it after the training correctly updated the task and I was able to store artifacts remotely
I will let the team answer you on that one ๐
I don't think there is an example for this use case in the repo currently, but the code should be fairly simple (below is a rough draft of what it could look like)
` controller_task = Task.init(...)
controller_task.execute_remotely(queue_name="services", clone=False, exit_process=True)
while True:
periodic_task = Task.clone(template_task_id)
# Change parameters of {periodic_task} if necessary
Task.enqueue(periodic_task, queue="default")
time.sleep(TRIGGER_TASK_INTERVAL_SECS) `
Hi PompousParrot44 , you could have a Controller task running in the services queue that periodically schedules the task you want to run
I also tried setting ebs_device_name = "/dev/sdf"
- didn't work
I think waiting for the apt locks to be released with something like this would workstartup_bash_script = [ "#!/bin/bash", "while sudo fuser /var/{lib/{dpkg,apt/lists},cache/apt/archives}/lock >/dev/null 2>&1; do echo 'Waiting for other instances of apt to complete...'; sleep 5; done", "sudo apt-get update", ...
Weirdly this throws an error in the autoscaler:
` Spinning new instance type=v100_spot
Error: Failed to start new instance, unexpected '{' in field...
Thats how I would do it, maybe guys from allegro-ai can come up with a better approach ๐
as it's also based on pytorch-ignite!
I am not sure to understand, what is the link with pytorch-ignite?
We're in the brainstorming phase of what are the best approaches to integrate, we might pick your brain later on
Awesome, I'd be happy to help!
Very cool! Run two train-agent daemons, one per GPU on the same machine, with default Nvidia/CUDA Docker
This is close to my use case, I just would like to run these two daemons not with docker, would that be possible? I should just remove the --docker nvidia/cuda
param right?
trains-agent daemon --gpus 0 --queue default & trains-agent daemon --gpus 1 --queue default &
Ho the object is actually available in previous_task.artifacts
Hi there, yes I was able to make it work with some glue code:
Save your model, optimizer, scheduler every epoch Have a separate thread that periodically pulls the instance metadata and check if the instance is marked for stop, in this case, add a custom tag eg. TO_RESUME Have a services that periodically pulls failed experiments from the queue with the tag TO_RESUME, force marking them as stopped instead of failed and reschedule them with as extra-param the last checkpoint
ย you mean โdockerโ was not installed and it did not throw an error ?
Yes docker was not installed in the machine
Yes you must make sure the docker can mount a persistent folder for you to work on.
Ok, it would be nice to have a --user-folder-mounted that do the linking automatically
So I installed docker, added user to group allowed to run docker (not to have to run with sudo, otherwise it fails), then ran these two commands and it worked
I got some progress TimelyPenguin76 , Now the task runs and I get the error from docker:docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
Alright, how can I then mount a volume of the disk?
So that I donโt loose what I worked on when stopping the session, and if I need to, I can ssh to the machine and directly access the content inside the user folder
is there a command / file for that?
Alright I have a followup question then: I used the param --user-folder โ~/projects/my-projectโ, but any change I do is not reflected in this folder. I guess I am in the docker space, but this folder is not linked to my the folder on the machine. Is it possible to do so?
Awesome! Thanks! ๐
` Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '--network', 'host', '-e', 'CLEARML_WORKER_ID=office:worker-0:docker', '-e', 'CLEARML_DOCKER_IMAGE=nvidia/cuda:10.1-runtime-ubuntu18.04 --network host', '-v', '/home/user/.gitconfig:/root/.gitconfig', '-v', '/tmp/.clearml_agent.toc3_yks.cfg:/root/clearml.conf', '-v', '/tmp/clearml_agent.ssh.1dsz4bz8:/root/.ssh', '-v', '/home/user/.clearml/apt-cache.2:/var/cache/apt/archives', '-v', '/home/user/.clearml/pip-cache:/root/.cache/pip', '...