Hi @<1576381444509405184:profile|ManiacalLizard2> , sorry for the late response - please do 🙏
@<1523701087100473344:profile|SuccessfulKoala55> Should I raise a github issue ?
oh ... maybe the bottleneck is augmentation in CPU !
But is it normal that the agent don't detect the GPU count and type properly ?
Hi @<1576381444509405184:profile|ManiacalLizard2> , can you check what is the environment variable value for NVIDIA_VISIBLE_DEVICES in the agent's process? You can check /proc/<agent-pid>/environ
and see
@<1523701087100473344:profile|SuccessfulKoala55> it is set to "all" as :
NV_LIBCUBLAS_VERSION=12.2.5.6-1NVIDIA_VISIBLE_DEVICES=allCLRML_API_SERVER_URL=https://<redacted>HOSTNAME=1b6a5b546a6bNVIDIA_REQUIRE_CUDA=cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526NV_NVTX_VERSION=12.2.140-1NV_LIBCUSPARSE_VERSION=12.1.2.141-1NV_LIBNPP_VERSION=12.2.1.4-1NCCL_VERSION=2.19.3-1PWD=/CLRML_FILE_SERVER_URL=<redacted>/clearmlCLRML_SECRET_KEY=<redacted>NVIDIA_DRIVER_CAPABILITIES=compute,utilityNV_LIBNPP_PACKAGE=libnpp-12-2=12.2.1.4-1NVIDIA_PRODUCT_NAME=CUDACLRML_ACCESS_KEY=TZQ8P5RNJ6IDLIZ5M3C0NV_CUDA_CUDART_VERSION=12.2.140-1HOME=/rootCLRML_CONTAINER_NAME=clearmlCUDA_VERSION=12.2.2NV_LIBCUBLAS_PACKAGE=libcublas-12-2=12.2.5.6-1CLRML_WEB_SERVER_URL=<redacted>NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-2CLRML_GIT_TOKEN=TERM=xtermCLRML_DOCKER_IMAGE=<redacted>/agent-image:v6SHLVL=1NV_CUDA_LIB_VERSION=12.2.2-1NVARCH=x86_64CLRML_ENV=prdCLRML_STORAGE_ACCOUNT=<redacted>CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/usr/bin/python3.10NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-2NV_LIBNCCL_PACKAGE=libnccl2=2.19.3-1+cuda12.2LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64CLRML_GIT_USER=CLEARML_WORKER_NAME=tff-AIOT-Q470EA-IM-A:<redacted>/agent-image:v6PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binNV_LIBNCCL_PACKAGE_NAME=libnccl2CLRML_STORAGE_KEY=<redacted>NV_LIBNCCL_PACKAGE_VERSION=2.19.3-1OLDPWD=/tmp/tmp.A3X3CWjlZc_=/usr/local/bin/clearml-agentroot@1b6a5b546a6b:/proc/68#
the weird thing is that: the GPU 0 seems to be in used as reported by nvtop in the host. But it is 50% slower than when running directly instead of through the clearml-agent ...