Good day,
We have been using ClearML for project monitoring and task management. Recently, we decided to try the Google Cloud Platform Autoscaler to automate our existing GCP VM creation pipelines, task queuing, and processing.
However, we've encountered several challenges while setting up the GCP autoscaler. Let me briefly introduce our use case.
We are running deep learning training workloads and aim to spin up GPU-enabled instances, start experiments, and monitor their progress. While we were able to achieve this using our local server as the ClearML agent, the autoscaler setup has proven problematic.
What we would like to do is:
- Use a base machine image for GPU-enabled VMs that is optimized for deep learning (so that it includes NVIDIA drivers, CUDA, etc.),
- Run a custom Docker image with our project's dependencies,
gcsfuse
, and gcloud
, in order to mount a GCP bucket containing our datasets,
- Set appropriate Docker runtime arguments , such as enabling the NVIDIA runtime and increasing shared memory to prevent data loader crashes.
We configured the autoscaler accordingly. Note that the CPU version works properly — VMs are successfully spun up, collect tasks, and execute them as expected. For GPU, we selected the following machine image:
projects/deeplearning-platform-release/global/images/family/pytorch-latest-cu113.
This image comes with preinstalled NVIDIA drivers, CUDA, and PyTorch.
However, when using this machine image, we encounter the following error:
Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
This suggests one of the following issues:
- The NVIDIA driver is not installed,
- The Docker container cannot access the driver,
- The NVIDIA container toolkit is misconfigured.
We made extensive efforts to resolve this:
- Verified that the machine image includes the NVIDIA driver (and also tested reinstallation),
- Tried using a base image without drivers and installing them manually (no success),
- Installed the full CUDA toolkit and NVIDIA container toolkit, and configured Docker to use the NVIDIA runtime, all via the init script,
- Added a reboot step after installation, which led to dangling VMs that stayed active but failed to pick up tasks.
Despite these steps — and confirming that the host had working drivers and CUDA — Docker continued to throw the same error. We also verified that nvidia-smi
works on the host, but containers still failed to initialize.
For sanity checking, we reverted to CPU mode. With that, the autoscaler works perfectly: VMs are spun up and tasks run without issue.
We would like to kindly ask for your assistance. We've been stuck on this issue for quite some time and cannot identify the root cause. Similar setups have worked for us on GCP in the past, but this one continues to fail despite our efforts.
Thank you in advance, we look forward to hearing from you.