IntriguedRat44 how do I reproduce it ?
Can you confirm that marking out the Task.init(..) call will fix it ?
No problem!
Yes, I’m running manual mode and I only see one GPU tracked in the resource monitoring. I’m using train
0.16.4
.
Everything seems to work as it should, but if I run without trains, my process is only visible on the one GPU i made visible with CUDA_VISIBLE_DEVICES
. If I run with Trains, it’s “registered” on all other devices as well if inspected with sudo fuser -v /dev/nvidia*
IntriguedRat44 could I ask you to open a GitHub issue on it?
I really do not want it to slip through our fingers...
(BTW: meanwhile I was not able to reproduce it, what's the OS / nvidia drivers you are using )?
I can confirm that marking out Task.init(…)
fixes it.
You can reproduce simply by taking the ClearML PyTorch MNIST example https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_mnist.py .
To clearly see it happening, it’s easiest if you get the GPU allocated before calling task = Task.init(…)
and to avoid crashing because you’re missing the task
variable, you can embed just before and after Task.init(…)
using IPython
. You also need the process ID of the main process to use to check against sudo fuser -v /dev/nvidia*
.
Summarizing, I move task = Task.init(…)
to just before the for epoch in range(…)
loop and replace it withimport psutil current_process_pid = psutil.Process().pid print(current_process_pid) # e.g 12971 import IPython; IPython.embed() task = Task.init(project_name='examples', task_name='pytorch mnist train') import IPython; IPython.embed()
You can then run the example until it reaches the embed and check that the main process printed is only visible on your designated device. Then you can quite the embed to see the Task.init
giving the problem after which you are waiting in the second embed. You can then quit that one to see training work fine.
You can then try the whole thing again without Task.init
but you need to remove reporting in that case (otherwise you get
Logger.current_logger().report_scalar( AttributeError: 'NoneType' object has no attribute 'report_scalar'
I haven’t tested on any other versions than trains 0.16.4
so I don’t know if it happens in the new clearml
package.
This is the output of sudo fuser -v /dev/nvidia*
for GPUs 0, 1 and 2 when I run a single experiment on GPU 0, a different user is running on GPU 1 and no-one is running on GPU 2 (remaining 7 GPUs omitted but are similar to 2).
This only happens when Task.init
is called. Never happens if not.
/dev/nvidia0: jdh 2448 F.... python /dev/nvidia1: je 315 F...m python3 jdh 2448 F.... python /dev/nvidia2: jdh 2448 F.... python
IntriguedRat44 If the monitoring only shows a single GPU (the selected one) it means it reads the correct CUDA_VISIBLE_DEVICES (this is how it knows that you are only using a selected GPU not all of them).
There is nothing else in the code that will change the OS environment.
Could you print os.environ['CUDA_VISIBLE_DEVICES'] while running the code to verify ?
Hi IntriguedRat44
Sorry, I missed this message...
I'm assuming you are running in manual mode (i.e. not through the agent), in that case we do not change the CUDA_VISIBLE_DEVICES.
What do you see in the resource monitoring? Is it a single GPU or multiple GPUs?
(Check the :monitor:gpu in the Scalar tab under results,)
Also what's the Trains/ClearML version you are suing and the OS ?
Thanks IntriguedRat44 !
I'll follow up on GitHub 🙂
I’ve verified that CUDA_VISIBLE_DEVICES
doesn’t get changed during the Task.init
call or anywhere else during the script.
I’ve created this issue https://github.com/allegroai/clearml/issues/305