Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Answered

Hi, I have a small issue about GPU monitoring. I run my training inside a Singularity container and I set the CUDA_VISIBLE_DEVICES variable. However, I get the following message: TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring . Any idea about how to fix this ? Thanks !

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Votes Newest

Answers 30

TimelyPenguin76 didn't fix the issue.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

BoredGoat1 where exactly do you think that happens ?
https://github.com/allegroai/trains/blob/master/trains/utilities/gpu/gpustat.py#L316
?
https://github.com/allegroai/trains/blob/master/trains/utilities/gpu/gpustat.py#L202

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

That's work in ipython

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Maybe permissions?!
you can test it manually by installing pynvml
and running:
from pynvml.smi import nvidia_smi nvsmi = nvidia_smi.getInstance() nvsmi.DeviceQuery('memory.free, memory.total')

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

so g.processes is None?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

and I have root permissions

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

I find the issue. In the code we must to add this condition
if self._active_gpus and i not in self._active_gpus: continueto be sure to not go in the for loop after. I propose to add this condition here: https://github.com/allegroai/trains/blob/e7864e6ba866a518ff07ab86da7c4703091fa94a/trains/utilities/resource_monitor.py#L302

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Done here: https://github.com/allegroai/trains/pull/170 🎉

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

It is already in the variable :
echo $LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Okay that is odd ...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes that's the part that is supposed to only pull the GPU usage for your process (and sub processes) instead of globally on the entire system

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Awesome, thank you!!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

hmmm I see...
It seems to miss the fact that your process do uses the GPU.
Maybe it only happens later, that the GPU is used?
Does that make sense ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

The script works. I tested to check where in the cpde the issue comes from and in the function: _get_gpu_stats(self) , g.processes is empty or None. Moreover, in _last_process_pool I only have cpu and no gpu. I think the issue is because one of the gpu return None instead of empty array. The for loop crash and so no GPU is logged

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Hi AgitatedDove14 , I can run nvidia-smi inside the container. However, I have this warning message

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

For my main GPU (for the training) it is empty array and for my other GPU it is empty

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

In the for loop here. processes is empty or None in my case. None is for my display GPU

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

BoredGoat1
Hmm, that means it should have worked with Trains as well.
Could you run the attached script, see if it works?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

https://github.com/allegroai/trains/blob/e7864e6ba866a518ff07ab86da7c4703091fa94a/trains/utilities/resource_monitor.py#L303

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Yes that's solved the issue. I will do the PR today

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Hi BoredGoat1
from this warning: " TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring " It seems trains failed to load the nvidia .so dll that does the GPU monitoring:
This is based on pynvml, and I think it is trying to access "libnvidia-ml.so.1"

Basically saying, if you can run nvidima-smi from inside the container, it should work.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My second graphic card is only for display.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

I tested and I have no more warning messages

if self._active_gpus and i not in self._active_gpus: continueThis solved it?

If so, PR pretty please 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes that is possible. I will try something to be sure

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Yes, that means the nvidia drivers are present (as you mentioned the GPU seems to be detected).
Could you check you have libnvidia-ml.so.1 inside the container ?
For example in /usr/lib/nvidia-XYZ/

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I tested and I have no more warning messages

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Okay could you test with export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/.singularity.d/libs/

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks !

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

I have the lib in the container ( /.singularity.d/libs/ ) FYI, my driver version is 418.67

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					BoredGoat1
				
					0
					 × 1

Merged 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

30 Answers

4 years ago

2 years ago