@<1601023807399661568:profile|PompousSpider11> I think you're missing the drivers installation, as described in the thread @<1523701205467926528:profile|AgitatedDove14> pointed to
Answered
Hi Community! I'M Trying To Set Up A Gcp Autoscaler Using The Following Machine Image / Docker Container:
Hi community! I'm trying to set up a GCP Autoscaler using the following machine image / docker container:
- machine image : projects/ml-images/global/images/c0-deeplearning-common-cu113-v20230807-debian-10
- docker image : nvidia/cuda:12.2.0-devel-ubuntu20.04
, and when the experiment is spun up, I get the following error starting the docker container:
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
I've tried with docker images where the cuda version matches that of the machine image (CUDA 11.3), but I still get the same error. If I've understood it correctly, the error is created when the docker is started, meaning that the libnvidia-ml.so.1
is missing from the machine image. Does anyone in this channel have suggestions regarding which image to use, or do I have to build it myself?
If I ssh to the worker instance in GCP, I can find the libnvidia-ml.so
sudo find / -iname 'libnvidia-ml.so*'
/usr/local/cuda-11.3/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
1K Views
2
Answers
one year ago
one year ago
Tags
Similar posts