general infrastructure question:
my company isn't using AWS for training, we have all our GPU's inhouse in our own servers, we have a problem where we want on one hand to give all the GPUs for the clearml-agent (ie' that they will be available for tasks) but on the other hand i want to give my developers the chance to develop on GPUS that arent being used.
the best case scenario of what i want is that the agent wouldn't give a GPU that has a task running on it, but i wasn't able to find if thats possible. my workaround that is that when my developers start their debugging process i will make a tool for them that restarts the daemon, but will change the --gpus argument to only include the gpu they arent working on (most of our servers have 2-8 gpus on them). in theory it should work, but i'm not sure what happens if lets say there is already a task running at GPU0 when a developer takes the daemon down. will the task keep running? is there a way to change which gpus are visible without taking down the daemon?

Posted one year ago
Votes Newest

Answers 6

im sorry im quite new,
do you mean if the daemon is running inside a docker container? or if the task itself is in a container?
the way i understood it, when i configure the task i set the base docker image and let it run with that

Posted one year ago

It will stop running

Posted one year ago

Unless you're running in docker mode, then I think the task will continue running inside the container. Might need to check it

Posted one year ago

Hi @<1612982606469533696:profile|ZealousFlamingo93> , for remote development on your gpus you can use clearml-session . Otherwise you would need to spin up and down the daemons

Posted one year ago

If you run an agent in docker mode ( --docker ) the agent will run a docker run command and the task will be executed inside a container. In that scenario, I think, if you kill the daemon then the docker will stay up and finish the job (i think, haven't tested)

Posted one year ago

ty for the reply!
ill look into clearml-session,
but lets say for now if i stick with spinning the daemons, if i take a daemon down while a task is already running, will it stop it? or will it continue to run?

Posted one year ago