Hi TimelyPenguin76
Thanks for working on this. The clearml gcp autoscaler is a major feature for us to have. I can't really evaluate clearml without some means of instantiating multiple agents on GCP machines and I'd really prefer not to have to set up a k8 cluster with agents and manage scaling it myself.
I tried the settings above with two resources, one for default queue and one for the services queue (making sure I use that image you suggested above for both).
The autoscaler started up without instantiating VM nodes and the UI was updated showing zero VMs in use as there were no tasks pending.
I then enqued the pipeline from our conversation above (pipeline from decorators) by re-running it from the clearml pielines UI and the autoscaler crashed trying to spin up a node. Attached is the full log file.
looking at GCP, I see two VMs were started. Looking at the logs of one of the worker VMs, things don't seem to have gone right on that end either. I've attached the log.
This is the resource_configurations
that the autoscaler task show I used:[{"resource_name": "gcp_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 3, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "gcp_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 2, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]
the extra_vm_bash_script
matches what you suggested I use:sudo apt install apt-transport-https ca-certificates curl software-properties-common -y curl -fsSL
| sudo apt-key add - sudo add-apt-repository "deb [arch=amd64]
lsb_release -cs
test" sudo apt update sudo apt install docker-ce -y
So to run with your current setup - you need to change the default docker image to python:3.9-bullseye
as mentioned.
I'll try a more carefully checked run a bit later but I know it's getting a bit late in your time zone
You're still using both n1-standard-1
and nvidia/cuda:10.2-runtime-ubuntu18.04
did you mean that I was running in CPU mode? I'll tried both but I'll try cpu mode with that base docker image
Is there any chance the experiment itself has a docker image specified? This might be overriding. Otherwise I would suggest shutting this one down and re-launching it š
I'll do a clean relaunch of everything (scaler and pipeline)
Can you try relaunching it just to make sure?
I noticed that the base docker image does not appear in the autoscaler task'
configuration_object
It should appear in the General section
I can try switching to gpu-enabled machines just to see if that path can be made to work but the services queue shouldn't need gpu so I hope we figure out running the pipeline task on cpu nodes
From the screenshots provided you ticked 'cpu' mode AND I think the machine that you're using n1-standard-1 is a cpu only machine, if I'm not mistaken.
so..
I restarted the autoscaler with this configuration object:[{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": null, "gpu_count": 1, "preemptible": false, "num_instances": 2, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]
specifying the python:3.9-bullseye
base image
The autoscaler seems to be running relatively ok (The log has some errors such as 2022-07-13 19:00:18,583 - clearml.Auto-Scaler - ERROR - Error: SSLError(1, '[SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:2635)'), retrying in 15 seconds
)
and currently three VMs are running in GCP compute engine
I then launched a new pipeline from https://clearml.slack.com/files/U03JT5JNS9M/F03PX2FSTK2/pipe_script.py (instead of cloning).
the (failed) pipeline task's console log is attached. It is still failing with:Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
presumably because it executed docker run with --gpus all
TimelyPenguin76 , CostlyOstrich36 thanks again for trying to work through this.
How about we change approach to make things easier?
Can you give me instructions on how to start a GCP Autoscaler of your choice that would work with the clearml pipline example such as the one I shared earlier https://clearml.slack.com/files/U03JT5JNS9M/F03PX2FSTK2/pipe_script.py ?
At this point, I just want to see an autoscaler that actually works (I'd need resources for the two queues, default
and services
that pipleines use and I don't mind whether or not they have gpus at this stage)
I noticed that the base docker image does not appear in the autoscaler task' configuration_object
which is:[{"resource_name": "cpu_default", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}, {"resource_name": "cpu_services", "machine_type": "n1-standard-1", "cpu_only": true, "gpu_type": "", "gpu_count": 1, "preemptible": false, "num_instances": 1, "queue_name": "services", "source_image": "projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131", "disk_size_gb": 100}]
Will check now and will send you the machine image + full configuration I used
Machine image: projects/ubuntu-os-cloud/global/images/ubuntu-1804-bionic-v20220131
Extra vm bash script:sudo apt install apt-transport-https ca-certificates curl software-properties-common -y curl -fsSL
| sudo apt-key add - sudo add-apt-repository "deb [arch=amd64]
lsb_release -cs
test" sudo apt update sudo apt install docker-ce -y
Hi PanickyMoth78 , thanks for the logs, I think I know the issue, iām trying to reproduce it my side, keeping you updated about it
I'll give it a try.
And if I wanted to support GPU in the default
queue, are you saying that I'd need a different machine from the n1-standard-1
?
Hi PanickyMoth78 , I noticed something - you're running in GPU mode but the default docker is a Cuda dependent docker. This might be causing the failures. Please try with python:3.9-bullseye
docker as the default docker for the autoscaler.
` Status: Downloaded newer image for nvidia/cuda:10.2-runtime-ubuntu18.04
1657737108941 dynamic_aws:cpu_services:n1-standard-1:4834718519308496943 DEBUG docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]].
time="2022-07-13T18:31:45Z" level=error msg="error waiting for container: context canceled" `As can be seen here š
that's strange because, opening the currently running autoscaler config I see this:
Here are screen shots of a VM I started with a gpu and one stared by the autoscaler with the setting above but whose GPU is missing (both in the zame gcp zone, us-central1-f ) . I may have misconfigured something or perhaps the autoscaler is failing to specify the GPU requirement correctly. :shrug:
I believe n1-standard-8
would work for that. I initially just tried going with the autoscaler defaults which has gpu on but that n1-standard-1
specified as the machine
I think so, yes. You need a machine with a GPU - this is assuming I'm correct about the n1-standard-1
machine
On the bright side, we started off with agents failing to run on VMs so this is progress š
Is there any chance the experiment itself has a docker image specified?
It does not as far as I know. The decorators do not have docker fields specified
switching the base image seems to have failed with the following error :2022-07-13 14:31:12 Unable to find image 'nvidia/cuda:10.2-runtime-ubuntu18.04' locally
attached is a pipeline task log file
I would also be interested in a GCP autoscaler, I did not know it was possible/available yet.
Trying to switch to a resources using gpu-enabled VMs failed with that same error above.
Looking at spawned VMs, they were spawned by the autoscaler without gpu even though I checked that my settings ( n1-standard-1
and nvidia-tesla-t4
and https://console.cloud.google.com/compute/imagesDetail/projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10?project=ml-tooling-test-external image for the VM) can be used to make vm instances and my gcp autoscaler configuration seems proper:[{"resource_name": "gpu_default3", "machine_type": "n1-standard-1", "cpu_only": false, "gpu_type": "nvidia-tesla-t4", "gpu_count": 1, "preemptible": false, "num_instances": 5, "queue_name": "default", "source_image": "projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10", "disk_size_gb": 100}, {"resource_name": "gpu_services3", "machine_type": "n1-standard-1", "cpu_only": false, "gpu_type": "nvidia-tesla-t4", "gpu_count": 1, "preemptible": false, "num_instances": 1, "queue_name": "services", "source_image": "projects/ml-images/global/images/c0-deeplearning-common-cu113-v20220701-debian-10", "disk_size_gb": 100}]
They're incompatible together as mentioned before