Reputation
Badges 1
8 × Eureka!No problem!
Yes, I’m running manual mode and I only see one GPU tracked in the resource monitoring. I’m using train
0.16.4
.
Everything seems to work as it should, but if I run without trains, my process is only visible on the one GPU i made visible with CUDA_VISIBLE_DEVICES
. If I run with Trains, it’s “registered” on all other devices as well if inspected with sudo fuser -v /dev/nvidia*
I can confirm that marking out Task.init(…)
fixes it.
You can reproduce simply by taking the ClearML PyTorch MNIST example https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_mnist.py .
To clearly see it happening, it’s easiest if you get the GPU allocated before calling task = Task.init(…)
and to avoid crashing because you’re missing the task
variable, you can embed just before and after Task.init(…)
using IPython
. You also n...
Thanks guys 🙏 I was looking for a way to do it that doesn’t require code changes, which I got - Much appreciated!
This is the output of sudo fuser -v /dev/nvidia*
for GPUs 0, 1 and 2 when I run a single experiment on GPU 0, a different user is running on GPU 1 and no-one is running on GPU 2 (remaining 7 GPUs omitted but are similar to 2).
This only happens when Task.init
is called. Never happens if not.
` /dev/nvidia0: jdh 2448 F.... python
/dev/nvidia1: je 315 F...m python3
jdh 2448 F.... python
/dev/nvidia2: jdh ...
I’ve verified that CUDA_VISIBLE_DEVICES
doesn’t get changed during the Task.init
call or anywhere else during the script.
I’ve created this issue https://github.com/allegroai/clearml/issues/305
Mostly manual execution for now
Yes, thank you AgitatedDove14