Hi BoredGoat1
from this warning: " TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
" It seems trains failed to load the nvidia .so dll that does the GPU monitoring:
This is based on pynvml, and I think it is trying to access "libnvidia-ml.so.1"
Basically saying, if you can run nvidima-smi from inside the container, it should work.
Yes that's solved the issue. I will do the PR today
For my main GPU (for the training) it is empty array and for my other GPU it is empty
I find the issue. In the code we must to add this conditionif self._active_gpus and i not in self._active_gpus: continue
to be sure to not go in the for loop after. I propose to add this condition here: https://github.com/allegroai/trains/blob/e7864e6ba866a518ff07ab86da7c4703091fa94a/trains/utilities/resource_monitor.py#L302
It is already in the variable :echo $LD_LIBRARY_PATH /usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs
BoredGoat1
Hmm, that means it should have worked with Trains as well.
Could you run the attached script, see if it works?
Yes that's the part that is supposed to only pull the GPU usage for your process (and sub processes) instead of globally on the entire system
In the for loop here. processes is empty or None in my case. None is for my display GPU
BoredGoat1 where exactly do you think that happens ?
https://github.com/allegroai/trains/blob/master/trains/utilities/gpu/gpustat.py#L316
?
https://github.com/allegroai/trains/blob/master/trains/utilities/gpu/gpustat.py#L202
I have the lib in the container ( /.singularity.d/libs/
) FYI, my driver version is 418.67
Hi AgitatedDove14 , I can run nvidia-smi inside the container. However, I have this warning message
hmmm I see...
It seems to miss the fact that your process do uses the GPU.
Maybe it only happens later, that the GPU is used?
Does that make sense ?
Maybe permissions?!
you can test it manually by installing pynvml
and running:from pynvml.smi import nvidia_smi nvsmi = nvidia_smi.getInstance() nvsmi.DeviceQuery('memory.free, memory.total')
The script works. I tested to check where in the cpde the issue comes from and in the function: _get_gpu_stats(self)
, g.processes
is empty or None. Moreover, in _last_process_pool
I only have cpu and no gpu. I think the issue is because one of the gpu return None instead of empty array. The for loop crash and so no GPU is logged
Done here: https://github.com/allegroai/trains/pull/170 🎉
Yes, that means the nvidia drivers are present (as you mentioned the GPU seems to be detected).
Could you check you have libnvidia-ml.so.1 inside the container ?
For example in /usr/lib/nvidia-XYZ/
Okay could you test with export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/.singularity.d/libs/
Yes that is possible. I will try something to be sure
I tested and I have no more warning messages
if self._active_gpus and i not in self._active_gpus: continue
This solved it?
If so, PR pretty please 🙂