Hi. I'D Like To Try The Gcp Autoscaler.

Answered

Hi. I'd like to try the GCP autoscaler.
What permissions does the service account that I provide to clearml need? (and what GCP API should I enable in the GCP project?) in the "GCP credentials" text box - do I provide the full contents of a credentials json file?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Votes Newest

Answers 30

On the bright side, we started off with agents failing to run on VMs so this is progress 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

They're incompatible together as mentioned before

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I think so, yes. You need a machine with a GPU - this is assuming I'm correct about the n1-standard-1 machine

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Fair point

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

so..
I restarted the autoscaler with this configuration object:
[{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 2, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]specifying the python:3.9-bullseye base image
The autoscaler seems to be running relatively ok (The log has some errors such as 2022-07-13 19:00:18,583 - clearml.Auto-Scaler - ERROR - Error: SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2635)'), retrying in 15 seconds )
and currently three VMs are running in GCP compute engine

I then launched a new pipeline from https://clearml.slack.com/files/U03JT5JNS9M/F03PX2FSTK2/pipe_script.py (instead of cloning).
the (failed) pipeline task's console log is attached. It is still failing with:
Error response from daemon: could not select device driver "" with capabilities: [[gpu]].presumably because it executed docker run with --gpus all

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

that's strange because, opening the currently running autoscaler config I see this:

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hi TimelyPenguin76
Thanks for working on this. The clearml gcp autoscaler is a major feature for us to have. I can't really evaluate clearml without some means of instantiating multiple agents on GCP machines and I'd really prefer not to have to set up a k8 cluster with agents and manage scaling it myself.

I tried the settings above with two resources, one for default queue and one for the services queue (making sure I use that image you suggested above for both).
The autoscaler started up without instantiating VM nodes and the UI was updated showing zero VMs in use as there were no tasks pending.

I then enqued the pipeline from our conversation above (pipeline from decorators) by re-running it from the clearml pielines UI and the autoscaler crashed trying to spin up a node. Attached is the full log file.

looking at GCP, I see two VMs were started. Looking at the logs of one of the worker VMs, things don't seem to have gone right on that end either. I've attached the log.

This is the resource_configurations that the autoscaler task show I used:
[{"resource_name": "gcp_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 3, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "gcp_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 2, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]the extra_vm_bash_script matches what you suggested I use:
sudo apt install apt-transport-https ca-certificates curl software-properties-common -y curl -fsSL | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] lsb_release -cstest" sudo apt update sudo apt install docker-ce -y

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I'll do a clean relaunch of everything (scaler and pipeline)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

So to run with your current setup - you need to change the default docker image to python:3.9-bullseye as mentioned.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Hi PanickyMoth78 , I noticed something - you're running in GPU mode but the default docker is a Cuda dependent docker. This might be causing the failures. Please try with python:3.9-bullseye docker as the default docker for the autoscaler.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Will check now and will send you the machine image + full configuration I used

Machine image: projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131

Extra vm bash script:
sudo apt install apt-transport-https ca-certificates curl software-properties-common -y curl -fsSL | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] lsb_release -cstest" sudo apt update sudo apt install docker-ce -y

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1 and nvidia-tesla-t4 and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler configuration seems proper:
[{"resource_name": "gpu_default3", "machine_type": "n1-standard-1", "cpu_only": false, "gpu_type": "nvidia-tesla-t4", "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10", "disk_size_gb": 100}, {"resource_name": "gpu_services3", "machine_type": "n1-standard-1", "cpu_only": false, "gpu_type": "nvidia-tesla-t4", "gpu_count": 1, "preemptible": false, "num_instances": 1, "queue_name": "services", "source_image": "projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10", "disk_size_gb": 100}]

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Hi PanickyMoth78 , thanks for the logs, I think I know the issue, i’m trying to reproduce it my side, keeping you updated about it

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

TimelyPenguin76 , CostlyOstrich36 thanks again for trying to work through this.

How about we change approach to make things easier?

Can you give me instructions on how to start a GCP Autoscaler of your choice that would work with the clearml pipline example such as the one I shared earlier https://clearml.slack.com/files/U03JT5JNS9M/F03PX2FSTK2/pipe_script.py ?

At this point, I just want to see an autoscaler that actually works (I'd need resources for the two queues, default and services that pipleines use and I don't mind whether or not they have gpus at this stage)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I'll try a more carefully checked run a bit later but I know it's getting a bit late in your time zone

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I believe n1-standard-8 would work for that. I initially just tried going with the autoscaler defaults which has gpu on but that n1-standard-1 specified as the machine

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Is there any chance the experiment itself has a docker image specified? This might be overriding. Otherwise I would suggest shutting this one down and re-launching it 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I noticed that the base docker image does not appear in the autoscaler task' configuration_object
which is:
[{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "num_instances": 1, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

switching the base image seems to have failed with the following error :
2022-07-13 14:31:12 Unable to find image 'nvidia/cuda:10.2-runtime-ubuntu18.04' locallyattached is a pipeline task log file

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Can you try relaunching it just to make sure?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

did you mean that I was running in CPU mode? I'll tried both but I'll try cpu mode with that base docker image

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I'll give it a try.
And if I wanted to support GPU in the default queue, are you saying that I'd need a different machine from the n1-standard-1 ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

I would also be interested in a GCP autoscaler, I did not know it was possible/available yet.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					HurtWoodpecker30
				
					0
					 × 1

I can try switching to gpu-enabled machines just to see if that path can be made to work but the services queue shouldn't need gpu so I hope we figure out running the pipeline task on cpu nodes

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

You're still using both n1-standard-1 and nvidia/cuda:10.2-runtime-ubuntu18.04

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I noticed that the base docker image does not appear in the autoscaler task'

configuration_object

It should appear in the General section

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Here are screen shots of a VM I started with a gpu and one stared by the autoscaler with the setting above but whose GPU is missing (both in the zame gcp zone, us-central1-f ) . I may have misconfigured something or perhaps the autoscaler is failing to specify the GPU requirement correctly. :shrug:

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

` Status: Downloaded newer image for nvidia/cuda:10.2-runtime-ubuntu18.04

1657737108941 dynamic_aws:cpu_services:n1-standard-1:4834718519308496943 DEBUG docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
time="2022-07-13T18:31:45Z" level=error msg="error waiting for container: context canceled" `As can be seen here 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

From the screenshots provided you ticked 'cpu' mode AND I think the machine that you're using n1-standard-1 is a cpu only machine, if I'm not mistaken.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Is there any chance the experiment itself has a docker image specified?

It does not as far as I know. The decorators do not have docker fields specified

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

Write your answer

2K Views

30 Answers

3 years ago

2 years ago