I find the issue. In the code we must to add this condition
if self._active_gpus and i not in self._active_gpus: continueto be sure to not go in the for loop after. I propose to add this condition here: https://github.com/allegroai/trains/blob/e7864e6ba866a518ff07ab86da7c4703091fa94a/trains/utilities/resource_monitor.py#L302
BoredGoat1 where exactly do you think that happens ?
from this warning: "
TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring " It seems trains failed to load the nvidia .so dll that does the GPU monitoring:
This is based on pynvml, and I think it is trying to access "libnvidia-ml.so.1"
Basically saying, if you can run nvidima-smi from inside the container, it should work.
The script works. I tested to check where in the cpde the issue comes from and in the function:
g.processes is empty or None. Moreover, in
_last_process_pool I only have cpu and no gpu. I think the issue is because one of the gpu return None instead of empty array. The for loop crash and so no GPU is logged