PompousBeetle71 could you try trains-agent 0.15.0rc0 ? What's the OS you are using? Are you running in docker mode, if so, what's the docker version?
AgitatedDove14 I can't try the new agent at the moment, the OS is Ubuntu 18.04 more specifically: amazon/Deep Learning Base AMI (Ubuntu 18.04) Version 22.0
and no docker. Running on the machine.
AgitatedDove14 I'm using that code in the meanwhile
` ### This script checks the number of GPUs, create a list like 0,1,2...
Then adds '--gpus' before that list of GPUs
NUM_GPUS=nvidia-smi -L | wc -l
NUM_GPUS=$(($NUM_GPUS-1))
OUT=()
if [ $NUM_GPUS -ge 0 ]
then
for i in $(seq 0 $NUM_GPUS); do OUT+=( "$i" ); done
echo ${OUT[*]// /|} | tr ' ' ',' | awk '{print "--gpus "$1}'
else
echo ""
fi `
PompousBeetle71 , what you are saying that for some reason the --gpus all will not configure the Nvidia drivers to use all the gpus, when running bare metal (i.e no docker). Did I understand you correctly ?
p.s. any chance you can get me the nvidia driver version? I can't seem to find the one for v22 on amazon
AgitatedDove14 yes, you're right. it was 10.2 or 10.1 if I recall.
PompousBeetle71 , These are cuda versions, I'm looking for the nvidia driver version for example 440.xx or 418.xx .
The reason is, we set an OS environment for the driver, and I remember that old drivers did not support it . Basically they do not support NVIDIA_VISIBLE_DEVICES=all , so I'm trying to see if that's the case, then we could add fix .