Sorry, I missed this message...
I'm assuming you are running in manual mode (i.e. not through the agent), in that case we do not change the CUDA_VISIBLE_DEVICES.
What do you see in the resource monitoring? Is it a single GPU or multiple GPUs?
(Check the :monitor:gpu in the Scalar tab under results,)
Also what's the Trains/ClearML version you are suing and the OS ?
This is the output of
sudo fuser -v /dev/nvidia* for GPUs 0, 1 and 2 when I run a single experiment on GPU 0, a different user is running on GPU 1 and no-one is running on GPU 2 (remaining 7 GPUs omitted but are similar to 2).
This only happens when
Task.init is called. Never happens if not.
/dev/nvidia0: jdh 2448 F.... python /dev/nvidia1: je 315 F...m python3 jdh 2448 F.... python /dev/nvidia2: jdh 2448 F.... python
Yes, I’m running manual mode and I only see one GPU tracked in the resource monitoring. I’m using
Everything seems to work as it should, but if I run without trains, my process is only visible on the one GPU i made visible with
CUDA_VISIBLE_DEVICES . If I run with Trains, it’s “registered” on all other devices as well if inspected with
sudo fuser -v /dev/nvidia*
IntriguedRat44 If the monitoring only shows a single GPU (the selected one) it means it reads the correct CUDA_VISIBLE_DEVICES (this is how it knows that you are only using a selected GPU not all of them).
There is nothing else in the code that will change the OS environment.
Could you print os.environ['CUDA_VISIBLE_DEVICES'] while running the code to verify ?
I can confirm that marking out
Task.init(…) fixes it.
You can reproduce simply by taking the ClearML PyTorch MNIST example https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_mnist.py .
To clearly see it happening, it’s easiest if you get the GPU allocated before calling
task = Task.init(…) and to avoid crashing because you’re missing the
task variable, you can embed just before and after
IPython . You also need the process ID of the main process to use to check against
sudo fuser -v /dev/nvidia* .
Summarizing, I move
task = Task.init(…) to just before the
for epoch in range(…) loop and replace it with
import psutil current_process_pid = psutil.Process().pid print(current_process_pid) # e.g 12971 import IPython; IPython.embed() task = Task.init(project_name='examples', task_name='pytorch mnist train') import IPython; IPython.embed()
You can then run the example until it reaches the embed and check that the main process printed is only visible on your designated device. Then you can quite the embed to see the
Task.init giving the problem after which you are waiting in the second embed. You can then quit that one to see training work fine.
You can then try the whole thing again without
Task.init but you need to remove reporting in that case (otherwise you get
Logger.current_logger().report_scalar( AttributeError: 'NoneType' object has no attribute 'report_scalar'
I haven’t tested on any other versions than
trains 0.16.4 so I don’t know if it happens in the new