Hi. I'D Like To Try The Gcp Autoscaler.

Unanswered

so..
I restarted the autoscaler with this configuration object:
[{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 2, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]specifying the python:3.9-bullseye base image
The autoscaler seems to be running relatively ok (The log has some errors such as 2022-07-13 19:00:18,583 - clearml.Auto-Scaler - ERROR - Error: SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2635)'), retrying in 15 seconds )
and currently three VMs are running in GCP compute engine

I then launched a new pipeline from https://clearml.slack.com/files/U03JT5JNS9M/F03PX2FSTK2/pipe_script.py (instead of cloning).
the (failed) pipeline task's console log is attached. It is still failing with:
Error response from daemon: could not select device driver "" with capabilities: [[gpu]].presumably because it executed docker run with --gpus all

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

174 Views

0 Answers

2 years ago

one year ago