Hi, We Have An Agent Running Inside A Nvidia Official Container. The Agent Seems To See The Gpu Driver But The Gpu Count Is 0 When I Join That Container,

Answered

Hi,
We have an agent running inside a Nvidia official container. The agent seems to see the GPU driver but the GPU count is 0
When I join that container, nvidia-smi report the GPUs correcty. The agent is launched with clearml-agent --gpus 0
ClearML-agent v1.7.0 and ClearML v1.14.4

  				
Posted 
	one year ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Votes Newest

Answers 6

the weird thing is that: the GPU 0 seems to be in used as reported by nvtop in the host. But it is 50% slower than when running directly instead of through the clearml-agent ...

  				
Posted 
	one year ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

oh ... maybe the bottleneck is augmentation in CPU !
But is it normal that the agent don't detect the GPU count and type properly ?

  				
Posted 
	one year ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

SuccessfulKoala55 it is set to "all" as :

NV_LIBCUBLAS_VERSION=12.2.5.6-1NVIDIA_VISIBLE_DEVICES=allCLRML_API_SERVER_URL=https://<redacted>HOSTNAME=1b6a5b546a6bNVIDIA_REQUIRE_CUDA=cuda>=12.2 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526NV_NVTX_VERSION=12.2.140-1NV_LIBCUSPARSE_VERSION=12.1.2.141-1NV_LIBNPP_VERSION=12.2.1.4-1NCCL_VERSION=2.19.3-1PWD=/CLRML_FILE_SERVER_URL=<redacted>/clearmlCLRML_SECRET_KEY=<redacted>NVIDIA_DRIVER_CAPABILITIES=compute,utilityNV_LIBNPP_PACKAGE=libnpp-12-2=12.2.1.4-1NVIDIA_PRODUCT_NAME=CUDACLRML_ACCESS_KEY=TZQ8P5RNJ6IDLIZ5M3C0NV_CUDA_CUDART_VERSION=12.2.140-1HOME=/rootCLRML_CONTAINER_NAME=clearmlCUDA_VERSION=12.2.2NV_LIBCUBLAS_PACKAGE=libcublas-12-2=12.2.5.6-1CLRML_WEB_SERVER_URL=<redacted>NV_LIBCUBLAS_PACKAGE_NAME=libcublas-12-2CLRML_GIT_TOKEN=TERM=xtermCLRML_DOCKER_IMAGE=<redacted>/agent-image:v6SHLVL=1NV_CUDA_LIB_VERSION=12.2.2-1NVARCH=x86_64CLRML_ENV=prdCLRML_STORAGE_ACCOUNT=<redacted>CLEARML_AGENT_SKIP_PIP_VENV_INSTALL=/usr/bin/python3.10NV_CUDA_COMPAT_PACKAGE=cuda-compat-12-2NV_LIBNCCL_PACKAGE=libnccl2=2.19.3-1+cuda12.2LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64CLRML_GIT_USER=CLEARML_WORKER_NAME=tff-AIOT-Q470EA-IM-A:<redacted>/agent-image:v6PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/binNV_LIBNCCL_PACKAGE_NAME=libnccl2CLRML_STORAGE_KEY=<redacted>NV_LIBNCCL_PACKAGE_VERSION=2.19.3-1OLDPWD=/tmp/tmp.A3X3CWjlZc_=/usr/local/bin/clearml-agentroot@1b6a5b546a6b:/proc/68#

  				
Posted 
	one year ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

SuccessfulKoala55 Should I raise a github issue ?

  				
Posted 
	one year ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Hi ManiacalLizard2 , sorry for the late response - please do 🙏

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Hi ManiacalLizard2 , can you check what is the environment variable value for NVIDIA_VISIBLE_DEVICES in the agent's process? You can check /proc/<agent-pid>/environ and see

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

1K Views

6 Answers

one year ago