Hey, so I am able to spin up the GCP instance using the autoscaler, I wanted to confirm one thing does the autoscaler spins up the agent automatically in the VM or do I need to add the script for that to the bash script
I did provide the credentials, and also I am running up the autoscaler for the first time, so no it hasn't worked before
Also @<1523701087100473344:profile|SuccessfulKoala55> when autoscaler spins up my GCP instance, when I look inside it I am not able to find the clearml.conf file, does it not install clearml automatically when it spins up the VM?
Also I was facing another issue, the task is not able to clone the github repo, it's showing authentication error even though I have passed my git credentials
Hi @<1610083503607648256:profile|DiminutiveToad80> , apologies for the delay - is it possible that a T4 is not available in the zone you're configuring?
Well the VM is running in the default docker nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04, but it's not spinning up the agent when the VM is intialized
@<1610083503607648256:profile|DiminutiveToad80> how is this section of the autoscaler wizard configured?
I was able to set up a GCP VM manually earlier, like without the autoscaler
Thanks!
Hmm from here : None
Could it be you do not have privileges to the resource, or that you did not provide credentials ?
Did that autoscaler work before ?
While creating a GCP credentials using None
What values should I insert in the following step so that the autoscaler has access, as of now I left this field blank
Good to hear - what did you change?
Regarding your question, the autoscaler will automatically inject a startup script to do that for you, but you will need to make sure the VM contains docker
So funny thing I was making a typo while writing the GPU type, I was writing NVIDIA T4 instead of nvidia-tesla-t4
Hi @<1610083503607648256:profile|DiminutiveToad80>
I think we will need more context for the log...
but I think there is something wrong with the GCP resource configuration of your autoscaler
Can you send the full autoscaler log and the configuration ?
2023-10-03 20:46:07,100 - clearml.Auto-Scaler - INFO - Spinning new instance resource='clearml-autoscaler-vm', prefix='dynamic_gcp', queue='default'
2023-10-03 20:46:07,107 - googleapiclient.discovery_cache - INFO - file_cache is only supported with oauth2client<4.0.0
2023-10-03 20:46:07,122 - clearml.Auto-Scaler - INFO - Creating regular instance for resource clearml-autoscaler-vm
2023-10-03 20:46:07,264 - clearml.Auto-Scaler - INFO - --- Cloud instances (0):
2023-10-03 20:46:07,482 - clearml.Auto-Scaler - INFO - stopping
2023-10-03 20:46:07,482 - clearml.Auto-Scaler - INFO - state change: State.RUNNING -> State.STOPPED
2023-10-03 20:46:07,482 - clearml.Auto-Scaler - INFO - Autoscaler exits
2023-10-03 20:46:07,556 - clearml.Auto-Scaler - ERROR - Failed to start new instance (resource 'clearml-autoscaler-vm'), Error: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/../zones/ returned "The resource 'projects/../zones/us-central1-a/ was not found". Details: "[{'message': "The resource 'projects/../zones/ was not found", 'domain': 'global', 'reason': 'notFound'}]">
Traceback (most recent call last):
File "/root/.clearml/venvs-builds/3/task_repository/clearml-apps.git/apps/auto_scaler/auto_scaler.py", line 742, in launch_one
instance_id = self.driver.spin_up_worker(resource_conf, worker_prefix, queue, task_id=task_id)
File "/root/.clearml/venvs-builds/3/task_repository/clearml-apps.git/apps/auto_scaler/cloud_driver.py", line 281, in spin_up_worker
instance_id, region = self._spin_up_worker(resource_conf, worker_prefix, queue_name, task_id)
File "/root/.clearml/venvs-builds/3/task_repository/clearml-apps.git/apps/auto_scaler/gcp_driver.py", line 194, in _spin_up_worker
exc, response = f(*args)
File "/root/.clearml/venvs-builds/3/task_repository/clearml-apps.git/apps/auto_scaler/networking.py", line 18, in wrapper
return func(obj_instance, *args, **kwargs)
File "/root/.clearml/venvs-builds/3/task_repository/clearml-apps.git/apps/auto_scaler/gcp_driver.py", line 137, in attempt_launch
spin_up_client.instances().insert(project=self.gcp_project_id, zone=zone, body=launch_spec).execute()
File "/root/venv/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
return wrapped(*args, **kwargs)
File "/root/venv/lib/python3.8/site-packages/googleapiclient/http.py", line 937, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/projects/zones/../ returned "The resource 'projects/../zones/' was not found". Details: "[{'message': "The resource 'projects/../zones/us-central1-a/acceleratorTypes/NVIDIA T4' was not found", 'domain': 'global', 'reason': 'notFound'}]">
1696365971074 apps-agent-i-08bf8b26b6175ea1f-1:service:8d816e475307473885aaa87b52a5c526 DEBUG Process aborted by user
@<1610083503607648256:profile|DiminutiveToad80> if you're using GCP, there's some machine image you should be specifying for the machine - the docker image is only used later by the agent, when the agent is running. Can you please elaborate on exactly what is starting inside the instance, and share logs to show it?
So, I am able to resolve the above issues