I can confirm that marking out Task.init(…)
fixes it.
You can reproduce simply by taking the ClearML PyTorch MNIST example https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_mnist.py .
To clearly see it happening, it’s easiest if you get the GPU allocated before calling task = Task.init(…)
and to avoid crashing because you’re missing the task
variable, you can embed just before and after Task.init(…)
using IPython
. You also need the process ID of the main process to use to check against sudo fuser -v /dev/nvidia*
.
Summarizing, I move task = Task.init(…)
to just before the for epoch in range(…)
loop and replace it withimport psutil current_process_pid = psutil.Process().pid print(current_process_pid) # e.g 12971 import IPython; IPython.embed() task = Task.init(project_name='examples', task_name='pytorch mnist train') import IPython; IPython.embed()
You can then run the example until it reaches the embed and check that the main process printed is only visible on your designated device. Then you can quite the embed to see the Task.init
giving the problem after which you are waiting in the second embed. You can then quit that one to see training work fine.
You can then try the whole thing again without Task.init
but you need to remove reporting in that case (otherwise you get
Logger.current_logger().report_scalar( AttributeError: 'NoneType' object has no attribute 'report_scalar'
I haven’t tested on any other versions than trains 0.16.4
so I don’t know if it happens in the new clearml
package.