When I Run Experiments I Set

Unanswered

I can confirm that marking out Task.init(…) fixes it.

You can reproduce simply by taking the ClearML PyTorch MNIST example https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_mnist.py .

To clearly see it happening, it’s easiest if you get the GPU allocated before calling task = Task.init(…) and to avoid crashing because you’re missing the task variable, you can embed just before and after Task.init(…) using IPython . You also need the process ID of the main process to use to check against sudo fuser -v /dev/nvidia* .

Summarizing, I move task = Task.init(…) to just before the for epoch in range(…) loop and replace it with
import psutil current_process_pid = psutil.Process().pid print(current_process_pid) # e.g 12971 import IPython; IPython.embed() task = Task.init(project_name='examples', task_name='pytorch mnist train') import IPython; IPython.embed()
You can then run the example until it reaches the embed and check that the main process printed is only visible on your designated device. Then you can quite the embed to see the Task.init giving the problem after which you are waiting in the second embed. You can then quit that one to see training work fine.

You can then try the whole thing again without Task.init but you need to remove reporting in that case (otherwise you get

Logger.current_logger().report_scalar( AttributeError: 'NoneType' object has no attribute 'report_scalar'
I haven’t tested on any other versions than trains 0.16.4 so I don’t know if it happens in the new clearml package.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					IntriguedBat44
				
					0
					 × 1

146 Views

0 Answers

3 years ago

one year ago