Clearml (Remote Execution) Sometimes Doesn'T "Pick-Up" Gpu. After I Rerun The Task It Picks It Up. Seems Random, Doesn'T Happen Too Often (Maybe Once In 30-40 Times) And I Cannot Seem To Detect Any Pattern. Did Anyone Else Notice This? Agents Are Vms On G

Answered

ClearML (remote execution) sometimes doesn't "pick-up" GPU. After I rerun the task it picks it up. Seems random, doesn't happen too often (maybe once in 30-40 times) and I cannot seem to detect any pattern. Did anyone else notice this?
Agents are VMs on GCP running in docker, driver installed and everything.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ZanySealion18
				
					0
					 × 1

Votes Newest

Answers 14

I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.

The first line of the Task console log should have the exact docker command that was used, this could be a good start
also check if there is any chance there is another agent listening to this queue, maybe it actually runs somewhere without a gpu at all?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I know about clearml.conf but wanted to avoid ssh-ing through 50 instances to edit it.
task.set_base_docker does the job, but docker_arguments doesn't propagate if I leave docker_image as None (it just uses both image and arguments from clearml.conf of the agent). If I explicitly state docker_image and docker_arguments in task.set_base_docker it works fine.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ZanySealion18
				
					0
					 × 1

I know about clearml.conf but wanted to avoid ssh-ing through 50 instances to edit it.

LOL yeah, btw: this is exactly the reason the enterprise version has a vault feature, so one could edit the base configuration in the UI and it automatically propagates everywhere

but docker_arguments doesn't propagate if I leave docker_image as None

yeah, that's correct, you have to select a container to be used

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This smells like a driver/image issue on the instance VM
What are you getting if add this inside your code?

os.system('nvidia-smi')

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Failed to initialize NVML: Unknown Error

yeah this is a driver issue. I think you need to check the VM image if the drivers match the GPU on that machine

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

"Executing: ['docker', 'run', '-t', '--gpus', '"device=0"'" - so the container is executed with --gpus.
However, torch.cuda.is_available() returns False.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ZanySealion18
				
					0
					 × 1

Failed to initialize NVML: Unknown Error

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ZanySealion18
				
					0
					 × 1

I'll check the docker command next time this happens, thanks! For the machines, all of them have GPUs (and are in fact identical/cloned VMs) and if I rerun it and get the same exact machine again it works so it's some part of "GPU detection" or something, we'll know more hopefully once it happens again, thanks.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ZanySealion18
				
					0
					 × 1

Hi @<1631102016807768064:profile|ZanySealion18>

ClearML (remote execution) sometimes doesn't "pick-up" GPU. After I rerun the task it picks it up.

what do you mean by "does not pick up"? is it the container is up but not executed with --gpus , so no GPU access?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Is there an easy way to add a docker argument in the python script?

On the task it self in the UI you can edit the docker arguments and add any missing flags
(task.set_base_docker will do the same from code)
You can also edit the configuration and always add this flag:
None

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It seems that task.set_base_docker must be called with docker_image as well (otherwise docker_arguments don't propagate), not sure if it's a bug or not, but I have a workaround now, thanks!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ZanySealion18
				
					0
					 × 1

In the Task info tab there is no GPU and pytorch doesn't see the cuda device.
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ZanySealion18
				
					0
					 × 1

Found this, seems to be exactly this: None
It appears that running docker as --privileged resolves the issue which is easier for me than to edit all of the instances I've already created. Is there an easy way to add a docker argument in the python script?
I've tried task.set_base_docker(docker_arguments="--privileged") right after Task.init but it doesn't seem to work.
Thanks!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ZanySealion18
				
					0
					 × 1

Yeah, I'm starting to lean towards enterprise solution more and more 😁
Thanks!

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ZanySealion18
				
					0
					 × 1

Write your answer

2K Views

14 Answers

one year ago