Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi. I'D Like To Try The Gcp Autoscaler.

Hi. I'd like to try the GCP autoscaler.
What permissions does the service account that I provide to clearml need? (and what GCP API should I enable in the GCP project?) in the "GCP credentials" text box - do I provide the full contents of a credentials json file?

  
  
Posted 2 years ago
Votes Newest

Answers 30


Hi TimelyPenguin76
Thanks for working on this. The clearml gcp autoscaler is a major feature for us to have. I can't really evaluate clearml without some means of instantiating multiple agents on GCP machines and I'd really prefer not to have to set up a k8 cluster with agents and manage scaling it myself.

I tried the settings above with two resources, one for default queue and one for the services queue (making sure I use that image you suggested above for both).
The autoscaler started up without instantiating VM nodes and the UI was updated showing zero VMs in use as there were no tasks pending.

I then enqued the pipeline from our conversation above (pipeline from decorators) by re-running it from the clearml pielines UI and the autoscaler crashed trying to spin up a node. Attached is the full log file.

looking at GCP, I see two VMs were started. Looking at the logs of one of the worker VMs, things don't seem to have gone right on that end either. I've attached the log.

This is the resource_configurations that the autoscaler task show I used:
[{"resource_name": "gcp_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 3, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "gcp_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 2, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]the extra_vm_bash_script matches what you suggested I use:
sudo apt install apt-transport-https ca-certificates curl software-properties-common -y curl -fsSL | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] lsb_release -cstest" sudo apt update sudo apt install docker-ce -y

  
  
Posted 2 years ago

So to run with your current setup - you need to change the default docker image to python:3.9-bullseye as mentioned.

  
  
Posted 2 years ago

Fair point

  
  
Posted 2 years ago

I'll try a more carefully checked run a bit later but I know it's getting a bit late in your time zone

  
  
Posted 2 years ago

You're still using both n1-standard-1 and nvidia/cuda:10.2-runtime-ubuntu18.04

  
  
Posted 2 years ago

did you mean that I was running in CPU mode? I'll tried both but I'll try cpu mode with that base docker image

  
  
Posted 2 years ago

Is there any chance the experiment itself has a docker image specified? This might be overriding. Otherwise I would suggest shutting this one down and re-launching it šŸ™‚

  
  
Posted 2 years ago

I'll do a clean relaunch of everything (scaler and pipeline)

  
  
Posted 2 years ago

Can you try relaunching it just to make sure?

  
  
Posted 2 years ago

I noticed that the base docker image does not appear in the autoscaler task'

configuration_object

It should appear in the General section

  
  
Posted 2 years ago

I can try switching to gpu-enabled machines just to see if that path can be made to work but the services queue shouldn't need gpu so I hope we figure out running the pipeline task on cpu nodes

  
  
Posted 2 years ago

From the screenshots provided you ticked 'cpu' mode AND I think the machine that you're using n1-standard-1 is a cpu only machine, if I'm not mistaken.

  
  
Posted 2 years ago

so..
I restarted the autoscaler with this configuration object:
[{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 2, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]specifying the python:3.9-bullseye base image
The autoscaler seems to be running relatively ok (The log has some errors such as 2022-07-13 19:00:18,583 - clearml.Auto-Scaler - ERROR - Error: SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2635)'), retrying in 15 seconds )
and currently three VMs are running in GCP compute engine

I then launched a new pipeline from https://clearml.slack.com/files/U03JT5JNS9M/F03PX2FSTK2/pipe_script.py (instead of cloning).
the (failed) pipeline task's console log is attached. It is still failing with:
Error response from daemon: could not select device driver "" with capabilities: [[gpu]].presumably because it executed docker run with --gpus all

  
  
Posted 2 years ago

TimelyPenguin76 , CostlyOstrich36 thanks again for trying to work through this.

How about we change approach to make things easier?

Can you give me instructions on how to start a GCP Autoscaler of your choice that would work with the clearml pipline example such as the one I shared earlier https://clearml.slack.com/files/U03JT5JNS9M/F03PX2FSTK2/pipe_script.py ?

At this point, I just want to see an autoscaler that actually works (I'd need resources for the two queues, default and services that pipleines use and I don't mind whether or not they have gpus at this stage)

  
  
Posted 2 years ago

I noticed that the base docker image does not appear in the autoscaler task' configuration_object
which is:
[{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "num_instances": 1, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]

  
  
Posted 2 years ago

Will check now and will send you the machine image + full configuration I used

Machine image: projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131

Extra vm bash script:
sudo apt install apt-transport-https ca-certificates curl software-properties-common -y curl -fsSL | sudo apt-key add - sudo add-apt-repository "deb [arch=amd64] lsb_release -cstest" sudo apt update sudo apt install docker-ce -y

  
  
Posted 2 years ago

Hi PanickyMoth78 , thanks for the logs, I think I know the issue, iā€™m trying to reproduce it my side, keeping you updated about it

  
  
Posted 2 years ago

I'll give it a try.
And if I wanted to support GPU in the default queue, are you saying that I'd need a different machine from the n1-standard-1 ?

  
  
Posted 2 years ago

Hi PanickyMoth78 , I noticed something - you're running in GPU mode but the default docker is a Cuda dependent docker. This might be causing the failures. Please try with python:3.9-bullseye docker as the default docker for the autoscaler.

  
  
Posted 2 years ago

` Status: Downloaded newer image for nvidia/cuda:10.2-runtime-ubuntu18.04

1657737108941 dynamic_aws:cpu_services:n1-standard-1:4834718519308496943 DEBUG docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
time="2022-07-13T18:31:45Z" level=error msg="error waiting for container: context canceled" `As can be seen here šŸ™‚

  
  
Posted 2 years ago

that's strange because, opening the currently running autoscaler config I see this:

  
  
Posted 2 years ago

Here are screen shots of a VM I started with a gpu and one stared by the autoscaler with the setting above but whose GPU is missing (both in the zame gcp zone, us-central1-f ) . I may have misconfigured something or perhaps the autoscaler is failing to specify the GPU requirement correctly. :shrug:

  
  
Posted 2 years ago

I believe n1-standard-8 would work for that. I initially just tried going with the autoscaler defaults which has gpu on but that n1-standard-1 specified as the machine

  
  
Posted 2 years ago

I think so, yes. You need a machine with a GPU - this is assuming I'm correct about the n1-standard-1 machine

  
  
Posted 2 years ago

On the bright side, we started off with agents failing to run on VMs so this is progress šŸ™‚

  
  
Posted 2 years ago

Is there any chance the experiment itself has a docker image specified?

It does not as far as I know. The decorators do not have docker fields specified

  
  
Posted 2 years ago

switching the base image seems to have failed with the following error :
2022-07-13 14:31:12 Unable to find image 'nvidia/cuda:10.2-runtime-ubuntu18.04' locallyattached is a pipeline task log file

  
  
Posted 2 years ago

I would also be interested in a GCP autoscaler, I did not know it was possible/available yet.

  
  
Posted 2 years ago

Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1 and nvidia-tesla-t4 and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler configuration seems proper:
[{"resource_name": "gpu_default3", "machine_type": "n1-standard-1", "cpu_only": false, "gpu_type": "nvidia-tesla-t4", "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10", "disk_size_gb": 100}, {"resource_name": "gpu_services3", "machine_type": "n1-standard-1", "cpu_only": false, "gpu_type": "nvidia-tesla-t4", "gpu_count": 1, "preemptible": false, "num_instances": 1, "queue_name": "services", "source_image": "projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10", "disk_size_gb": 100}]

  
  
Posted 2 years ago

They're incompatible together as mentioned before

  
  
Posted 2 years ago