so..
I restarted the autoscaler with this configuration object:[{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 2, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]
specifying the python:3.9-bullseye
base image
The autoscaler seems to be running relatively ok (The log has some errors such as 2022-07-13 19:00:18,583 - clearml.Auto-Scaler - ERROR - Error: SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2635)'), retrying in 15 seconds
)
and currently three VMs are running in GCP compute engine
I then launched a new pipeline from https://clearml.slack.com/files/U03JT5JNS9M/F03PX2FSTK2/pipe_script.py (instead of cloning).
the (failed) pipeline task's console log is attached. It is still failing with:Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
presumably because it executed docker run with --gpus all