This smells like a driver/image issue on the instance VM
What are you getting if add this inside your code?
os.system('nvidia-smi')
In the Task info tab there is no GPU and pytorch doesn't see the cuda device.
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.
Yeah, I'm starting to lean towards enterprise solution more and more 😁
Thanks!
"Executing: ['docker', 'run', '-t', '--gpus', '"device=0"'" - so the container is executed with --gpus.
However, torch.cuda.is_available() returns False.
It seems that task.set_base_docker must be called with docker_image as well (otherwise docker_arguments don't propagate), not sure if it's a bug or not, but I have a workaround now, thanks!
Is there an easy way to add a docker argument in the python script?
On the task it self in the UI you can edit the docker arguments and add any missing flags
(task.set_base_docker will do the same from code)
You can also edit the configuration and always add this flag:
None
I know about clearml.conf but wanted to avoid ssh-ing through 50 instances to edit it.
LOL yeah, btw: this is exactly the reason the enterprise version has a vault feature, so one could edit the base configuration in the UI and it automatically propagates everywhere
but docker_arguments doesn't propagate if I leave docker_image as None
yeah, that's correct, you have to select a container to be used
I know about clearml.conf but wanted to avoid ssh-ing through 50 instances to edit it.
task.set_base_docker does the job, but docker_arguments doesn't propagate if I leave docker_image as None (it just uses both image and arguments from clearml.conf of the agent). If I explicitly state docker_image and docker_arguments in task.set_base_docker it works fine.
I'll check the docker command next time this happens, thanks! For the machines, all of them have GPUs (and are in fact identical/cloned VMs) and if I rerun it and get the same exact machine again it works so it's some part of "GPU detection" or something, we'll know more hopefully once it happens again, thanks.
Hi @<1631102016807768064:profile|ZanySealion18>
ClearML (remote execution) sometimes doesn't "pick-up" GPU. After I rerun the task it picks it up.
what do you mean by "does not pick up"? is it the container is up but not executed with --gpus , so no GPU access?
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.
The first line of the Task console log should have the exact docker command that was used, this could be a good start
also check if there is any chance there is another agent listening to this queue, maybe it actually runs somewhere without a gpu at all?
Found this, seems to be exactly this: None
It appears that running docker as --privileged resolves the issue which is easier for me than to edit all of the instances I've already created. Is there an easy way to add a docker argument in the python script?
I've tried task.set_base_docker(docker_arguments="--privileged") right after Task.init but it doesn't seem to work.
Thanks!
Failed to initialize NVML: Unknown Error
yeah this is a driver issue. I think you need to check the VM image if the drivers match the GPU on that machine