Good Day, We Have Been Using Clearml For Project Monitoring And Task Management. Recently, We Decided To Try The Google Cloud Platform Autoscaler To Automate Our Existing Gcp Vm Creation Pipelines, Task Queuing, And Processing. However, We'Ve Encountere

Answered

Good day,

We have been using ClearML for project monitoring and task management. Recently, we decided to try the Google Cloud Platform Autoscaler to automate our existing GCP VM creation pipelines, task queuing, and processing.

However, we've encountered several challenges while setting up the GCP autoscaler. Let me briefly introduce our use case.
We are running deep learning training workloads and aim to spin up GPU-enabled instances, start experiments, and monitor their progress. While we were able to achieve this using our local server as the ClearML agent, the autoscaler setup has proven problematic.

What we would like to do is:

Use a base machine image for GPU-enabled VMs that is optimized for deep learning (so that it includes NVIDIA drivers, CUDA, etc.),
Run a custom Docker image with our project's dependencies, gcsfuse , and gcloud , in order to mount a GCP bucket containing our datasets,
Set appropriate Docker runtime arguments , such as enabling the NVIDIA runtime and increasing shared memory to prevent data loader crashes.
We configured the autoscaler accordingly. Note that the CPU version works properly — VMs are successfully spun up, collect tasks, and execute them as expected. For GPU, we selected the following machine image:
projects/deeplearning-platform-release/global/images/family/pytorch-latest-cu113.

This image comes with preinstalled NVIDIA drivers, CUDA, and PyTorch.
However, when using this machine image, we encounter the following error:
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

This suggests one of the following issues:

The NVIDIA driver is not installed,
The Docker container cannot access the driver,
The NVIDIA container toolkit is misconfigured.
We made extensive efforts to resolve this:
Verified that the machine image includes the NVIDIA driver (and also tested reinstallation),
Tried using a base image without drivers and installing them manually (no success),
Installed the full CUDA toolkit and NVIDIA container toolkit, and configured Docker to use the NVIDIA runtime, all via the init script,
Added a reboot step after installation, which led to dangling VMs that stayed active but failed to pick up tasks.
Despite these steps — and confirming that the host had working drivers and CUDA — Docker continued to throw the same error. We also verified that nvidia-smi works on the host, but containers still failed to initialize.

For sanity checking, we reverted to CPU mode. With that, the autoscaler works perfectly: VMs are spun up and tasks run without issue.
We would like to kindly ask for your assistance. We've been stuck on this issue for quite some time and cannot identify the root cause. Similar setups have worked for us on GCP in the past, but this one continues to fail despite our efforts.

Thank you in advance, we look forward to hearing from you.

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					CornyLobster42
				
					0
					 × 1

Votes Newest

Answers 6

It's not immediately obvious from the GCP documentation and you don't need to do this on AWS or Azure so it can catch you out. For what it's worth, the image I used originally was from the same family Marko has referenced above.

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					AmusedCat74
				
					0
					 × 1

I run using the GCP Autoscaler successfully for GPU. Have you included this line in the init-script of the autoscaler? This was a gotcha for me...

/opt/deeplearning/install-driver.sh

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					AmusedCat74
				
					0
					 × 1

Downgrading Ubuntu to 20.04 appears to have solved the issue! Thank you so much.

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					CornyLobster42
				
					0
					 × 1

Hi @<1826791494376230912:profile|CornyLobster42> , it looks like there might be an issue with the image. Have you tried other images? From what I see here - None

Many people are talking about various issues with various solutions.

What if you try with this image? projects/ml-images/global/images/c0-deeplearning-common-cu121-v20231209-debian-11

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1529271085315395584:profile|AmusedCat74> , wow that's an impressive find! Did you stumble on this mentioned by someone or did you figure it yourself?

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Given that nvidia-smi is working you may have already done that. In this case depending on your ubuntu version you may have another problem. ubuntu 22+ has this issue which has workaround. This also caught me out...

None

  				
Posted 
	6 months ago

					More
				  		
  Report
		
					AmusedCat74
				
					0
					 × 1

Write your answer

809 Views

6 Answers

6 months ago