Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Small Issue About Gpu Monitoring. I Run My Training Inside A Singularity Container And I Set The Cuda_Visible_Devices Variable. However, I Get The Following Message:

Hi, I have a small issue about GPU monitoring. I run my training inside a Singularity container and I set the CUDA_VISIBLE_DEVICES variable. However, I get the following message: TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring . Any idea about how to fix this ? Thanks !

  
  
Posted 4 years ago
Votes Newest

Answers 30


Thanks !

  
  
Posted 4 years ago

Merged 🙂

  
  
Posted 4 years ago

Done here: https://github.com/allegroai/trains/pull/170 🎉

  
  
Posted 4 years ago

Awesome, thank you!!

  
  
Posted 4 years ago

Yes that's solved the issue. I will do the PR today

  
  
Posted 4 years ago

I tested and I have no more warning messages

if self._active_gpus and i not in self._active_gpus: continueThis solved it?

If so, PR pretty please 🙂

  
  
Posted 4 years ago

I tested and I have no more warning messages

  
  
Posted 4 years ago

I find the issue. In the code we must to add this condition
if self._active_gpus and i not in self._active_gpus: continueto be sure to not go in the for loop after. I propose to add this condition here: https://github.com/allegroai/trains/blob/e7864e6ba866a518ff07ab86da7c4703091fa94a/trains/utilities/resource_monitor.py#L302

  
  
Posted 4 years ago

Yes that is possible. I will try something to be sure

  
  
Posted 4 years ago

hmmm I see...
It seems to miss the fact that your process do uses the GPU.
Maybe it only happens later, that the GPU is used?
Does that make sense ?

  
  
Posted 4 years ago

For my main GPU (for the training) it is empty array and for my other GPU it is empty

  
  
Posted 4 years ago

so g.processes is None?

  
  
Posted 4 years ago

Yes that's the part that is supposed to only pull the GPU usage for your process (and sub processes) instead of globally on the entire system

  
  
Posted 4 years ago

In the for loop here. processes is empty or None in my case. None is for my display GPU

  
  
Posted 4 years ago

My second graphic card is only for display.

  
  
Posted 4 years ago

The script works. I tested to check where in the cpde the issue comes from and in the function: _get_gpu_stats(self) , g.processes is empty or None. Moreover, in _last_process_pool I only have cpu and no gpu. I think the issue is because one of the gpu return None instead of empty array. The for loop crash and so no GPU is logged

  
  
Posted 4 years ago

BoredGoat1
Hmm, that means it should have worked with Trains as well.
Could you run the attached script, see if it works?

  
  
Posted 4 years ago

and I have root permissions

  
  
Posted 4 years ago

That's work in ipython

  
  
Posted 4 years ago

Maybe permissions?!
you can test it manually by installing pynvml
and running:
from pynvml.smi import nvidia_smi nvsmi = nvidia_smi.getInstance() nvsmi.DeviceQuery('memory.free, memory.total')

  
  
Posted 4 years ago

Okay that is odd ...

  
  
Posted 4 years ago

It is already in the variable :
echo $LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs

  
  
Posted 4 years ago

Okay could you test with export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/.singularity.d/libs/

  
  
Posted 4 years ago

I have the lib in the container ( /.singularity.d/libs/ ) FYI, my driver version is 418.67

  
  
Posted 4 years ago

Yes, that means the nvidia drivers are present (as you mentioned the GPU seems to be detected).
Could you check you have libnvidia-ml.so.1 inside the container ?
For example in /usr/lib/nvidia-XYZ/

  
  
Posted 4 years ago

Hi AgitatedDove14 , I can run nvidia-smi inside the container. However, I have this warning message

  
  
Posted 4 years ago

TimelyPenguin76 didn't fix the issue.

  
  
Posted 4 years ago

Hi BoredGoat1
from this warning: " TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring " It seems trains failed to load the nvidia .so dll that does the GPU monitoring:
This is based on pynvml, and I think it is trying to access "libnvidia-ml.so.1"

Basically saying, if you can run nvidima-smi from inside the container, it should work.

  
  
Posted 4 years ago