When I Run Experiments I Set

Answered

When I run experiments I set CUDA_VISIBLE_DEVICES to some integer to only make that device available to the main process (as is common). I can verify that this is in fact the case with sudo fuser -v /dev/nvidia* which shows that a single process has been created on the single device I chose.

However, I observe that a subsequent call to Task.init() in the python script somehow overrides this and “registers” the main process on all GPU devices of the node. This can be seen by inspecting sudo fuser -v /dev/nvidia* after the call to Task.init() . The original process ID registered on the device initially chosen with CUDA_VISIBLE_DEVICES is now registered on all GPU devices on the node.

I can only see this proces on any other devices than the chosen one when using sudo fuser but not with gpustat or nvidia-smi . I can also not see any memory being allocated on the other devices.

I am slightly worried about this behaviour. Does anyone know something about this?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					IntriguedBat44
				
					0
					 × 1

Votes Newest

Answers 10

IntriguedRat44 how do I reproduce it ?
Can you confirm that marking out the Task.init(..) call will fix it ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

No problem!
Yes, I’m running manual mode and I only see one GPU tracked in the resource monitoring. I’m using train 0.16.4 .

Everything seems to work as it should, but if I run without trains, my process is only visible on the one GPU i made visible with CUDA_VISIBLE_DEVICES . If I run with Trains, it’s “registered” on all other devices as well if inspected with sudo fuser -v /dev/nvidia*

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					IntriguedBat44
				
					0
					 × 1

IntriguedRat44 could I ask you to open a GitHub issue on it?
I really do not want it to slip through our fingers...
(BTW: meanwhile I was not able to reproduce it, what's the OS / nvidia drivers you are using )?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I can confirm that marking out Task.init(…) fixes it.

You can reproduce simply by taking the ClearML PyTorch MNIST example https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_mnist.py .

To clearly see it happening, it’s easiest if you get the GPU allocated before calling task = Task.init(…) and to avoid crashing because you’re missing the task variable, you can embed just before and after Task.init(…) using IPython . You also need the process ID of the main process to use to check against sudo fuser -v /dev/nvidia* .

Summarizing, I move task = Task.init(…) to just before the for epoch in range(…) loop and replace it with
import psutil current_process_pid = psutil.Process().pid print(current_process_pid) # e.g 12971 import IPython; IPython.embed() task = Task.init(project_name='examples', task_name='pytorch mnist train') import IPython; IPython.embed()
You can then run the example until it reaches the embed and check that the main process printed is only visible on your designated device. Then you can quite the embed to see the Task.init giving the problem after which you are waiting in the second embed. You can then quit that one to see training work fine.

You can then try the whole thing again without Task.init but you need to remove reporting in that case (otherwise you get

Logger.current_logger().report_scalar( AttributeError: 'NoneType' object has no attribute 'report_scalar'
I haven’t tested on any other versions than trains 0.16.4 so I don’t know if it happens in the new clearml package.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					IntriguedBat44
				
					0
					 × 1

This is the output of sudo fuser -v /dev/nvidia* for GPUs 0, 1 and 2 when I run a single experiment on GPU 0, a different user is running on GPU 1 and no-one is running on GPU 2 (remaining 7 GPUs omitted but are similar to 2).

This only happens when Task.init is called. Never happens if not.

/dev/nvidia0: jdh 2448 F.... python /dev/nvidia1: je 315 F...m python3 jdh 2448 F.... python /dev/nvidia2: jdh 2448 F.... python

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					IntriguedBat44
				
					0
					 × 1

IntriguedRat44 If the monitoring only shows a single GPU (the selected one) it means it reads the correct CUDA_VISIBLE_DEVICES (this is how it knows that you are only using a selected GPU not all of them).
There is nothing else in the code that will change the OS environment.
Could you print os.environ['CUDA_VISIBLE_DEVICES'] while running the code to verify ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi IntriguedRat44
Sorry, I missed this message...
I'm assuming you are running in manual mode (i.e. not through the agent), in that case we do not change the CUDA_VISIBLE_DEVICES.
What do you see in the resource monitoring? Is it a single GPU or multiple GPUs?
(Check the :monitor:gpu in the Scalar tab under results,)
Also what's the Trains/ClearML version you are suing and the OS ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks IntriguedRat44 !
I'll follow up on GitHub 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I’ve verified that CUDA_VISIBLE_DEVICES doesn’t get changed during the Task.init call or anywhere else during the script.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					IntriguedBat44
				
					0
					 × 1

I’ve created this issue https://github.com/allegroai/clearml/issues/305

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					IntriguedBat44
				
					0
					 × 1

Write your answer

1K Views

10 Answers

3 years ago

one year ago