Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Clearml (Remote Execution) Sometimes Doesn'T "Pick-Up" Gpu. After I Rerun The Task It Picks It Up. Seems Random, Doesn'T Happen Too Often (Maybe Once In 30-40 Times) And I Cannot Seem To Detect Any Pattern. Did Anyone Else Notice This? Agents Are Vms On G

ClearML (remote execution) sometimes doesn't "pick-up" GPU. After I rerun the task it picks it up. Seems random, doesn't happen too often (maybe once in 30-40 times) and I cannot seem to detect any pattern. Did anyone else notice this?
Agents are VMs on GCP running in docker, driver installed and everything.

  
  
Posted 6 months ago
Votes Newest

Answers 14


I'll check the docker command next time this happens, thanks! For the machines, all of them have GPUs (and are in fact identical/cloned VMs) and if I rerun it and get the same exact machine again it works so it's some part of "GPU detection" or something, we'll know more hopefully once it happens again, thanks.

  
  
Posted 6 months ago

I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.

The first line of the Task console log should have the exact docker command that was used, this could be a good start
also check if there is any chance there is another agent listening to this queue, maybe it actually runs somewhere without a gpu at all?

  
  
Posted 6 months ago

I know about clearml.conf but wanted to avoid ssh-ing through 50 instances to edit it.

LOL yeah, btw: this is exactly the reason the enterprise version has a vault feature, so one could edit the base configuration in the UI and it automatically propagates everywhere

but docker_arguments doesn't propagate if I leave docker_image as None

yeah, that's correct, you have to select a container to be used

  
  
Posted 4 months ago

Is there an easy way to add a docker argument in the python script?

On the task it self in the UI you can edit the docker arguments and add any missing flags
(task.set_base_docker will do the same from code)
You can also edit the configuration and always add this flag:
None

  
  
Posted 4 months ago

Found this, seems to be exactly this: None
It appears that running docker as --privileged resolves the issue which is easier for me than to edit all of the instances I've already created. Is there an easy way to add a docker argument in the python script?
I've tried task.set_base_docker(docker_arguments="--privileged") right after Task.init but it doesn't seem to work.
Thanks!

  
  
Posted 4 months ago

This smells like a driver/image issue on the instance VM
What are you getting if add this inside your code?

os.system('nvidia-smi')
  
  
Posted 4 months ago

"Executing: ['docker', 'run', '-t', '--gpus', '"device=0"'" - so the container is executed with --gpus.
However, torch.cuda.is_available() returns False.

  
  
Posted 4 months ago

I know about clearml.conf but wanted to avoid ssh-ing through 50 instances to edit it.
task.set_base_docker does the job, but docker_arguments doesn't propagate if I leave docker_image as None (it just uses both image and arguments from clearml.conf of the agent). If I explicitly state docker_image and docker_arguments in task.set_base_docker it works fine.

  
  
Posted 4 months ago

It seems that task.set_base_docker must be called with docker_image as well (otherwise docker_arguments don't propagate), not sure if it's a bug or not, but I have a workaround now, thanks!

  
  
Posted 4 months ago

Failed to initialize NVML: Unknown Error

yeah this is a driver issue. I think you need to check the VM image if the drivers match the GPU on that machine

  
  
Posted 4 months ago

Yeah, I'm starting to lean towards enterprise solution more and more 😁
Thanks!

  
  
Posted 4 months ago

Failed to initialize NVML: Unknown Error

  
  
Posted 4 months ago

In the Task info tab there is no GPU and pytorch doesn't see the cuda device.
I'm not sure how to debug it, that would be my first question. So I should first check if docker is executed with --gpus? I'll pay attention to this next time this happens, thanks.

  
  
Posted 6 months ago

Hi @<1631102016807768064:profile|ZanySealion18>

ClearML (remote execution) sometimes doesn't "pick-up" GPU. After I rerun the task it picks it up.

what do you mean by "does not pick up"? is it the container is up but not executed with --gpus , so no GPU access?

  
  
Posted 6 months ago
571 Views
14 Answers
6 months ago
4 months ago
Tags