Hi Community! I'M Trying To Set Up A Gcp Autoscaler Using The Following Machine Image / Docker Container:

Answered

Hi community! I'm trying to set up a GCP Autoscaler using the following machine image / docker container:

machine image : projects/ml-images/global/images/c0-deeplearning-common-cu113-v20230807-debian-10
docker image : nvidia/cuda:12.2.0-devel-ubuntu20.04
, and when the experiment is spun up, I get the following error starting the docker container:

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

I've tried with docker images where the cuda version matches that of the machine image (CUDA 11.3), but I still get the same error. If I've understood it correctly, the error is created when the docker is started, meaning that the libnvidia-ml.so.1 is missing from the machine image. Does anyone in this channel have suggestions regarding which image to use, or do I have to build it myself?

If I ssh to the worker instance in GCP, I can find the libnvidia-ml.so

sudo find / -iname 'libnvidia-ml.so*'
/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs/libnvidia-ml.so

  				
Posted 
	one year ago

					More  		
  Report
		
					PompousSpider11
				
					0
					 × 1

Votes Newest

Answers 2

Thread is discussed here: None

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

PompousSpider11 I think you're missing the drivers installation, as described in the thread AgitatedDove14 pointed to

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

1K Views

2 Answers

one year ago