Hello, I Have Clearml Autoscaler Setup. Previously, When A New Task Comes Up, An Already Running Worker (If There Is), Will Take It, Apply The New Commit And Run The Task. Now, I Get An Error. So, I Can'T Run A Task On An Already Running Worker. It Has To

Answered

Hello, I have clearml autoscaler setup. Previously, when a new task comes up, an already running worker (if there is), will take it, apply the new commit and run the task. Now, I get an error. So, I can't run a task on an already running worker. It has to start a new worker in order to work. The error is:

fatal: could not read Username for '

': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository. 
1) Make sure you pushed the requested commit:
(repository='git@github.com:iCardioAI/ai_dev.git', branch='mergetest', commit_id='47896212f9db6efa505add8cee04fc5d8e7daa6f', tag='', docker_cmd='722044220531.dkr.ecr.us-west-2.amazonaws.com/models:training_image_tf2_13', entry_point='task.py', working_dir='clearml_tasks/pipeline_evaluation')
2) Check if remote-worker has valid credentials [see worker configuration file]

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

Votes Newest

Answers 25

I don't think I am on the pro version, is that a paid one?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

It is an autoscaler for gcp. I think, there are unnecessary configs that were used in aws.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

clearml==1.14.1
That is the version.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

@<1523701070390366208:profile|CostlyOstrich36> I have been exploring. The problem seems to be when the docker container is using the cached dir.

Using cached repository in "/root/.clearml/vcs-cache/****.git.0081a6bc4d7afe6adde369e6aeab9406/****.git"

When inside that directory and tries to fetch, it asks for credentials. when it clones, it doesn't.

cloning: git@github.com:****/****.git
Using user/pass credentials - replacing ssh url 'git@github.com:****/****.git' with https url '

it uses cached repo when I start from an already started worker. It clones when it spins up a worker. The cloning and fetching happens in docker container for both cases.

The question is, how can I stop the cache so that it always clones the repo.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

@<1523701070390366208:profile|CostlyOstrich36>

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

The worker machines are on gcp

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

Exactly, @<1523701070390366208:profile|CostlyOstrich36>

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

And are they the same tasks?

I mean if you were to run the 'failing' task first, it would run, correct?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Where did you get this autoscaler? I don't think a GCP autoscaler was released in the open source

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

they are different tasks. I start a new task but it can be same commit sometimes.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

I don't say I am expert in this thing, but does the autoscaler have different version than the clearml?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

You are right. My colleague wrote it I think starting from the aws autoscaler.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

@<1523701070390366208:profile|CostlyOstrich36> I don't get that.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

configurations:
  extra_clearml_conf: 'sdk.aws.s3.region="us-west-2"
                        agent.extra_docker_arguments=["--shm-size=90g"]
                        agent.extra_docker_shell_script=["git config --global credential.helper cache --timeout=604800",]'
  extra_trains_conf: ''
  extra_vm_bash_script: ''
  queues:
    gcp-v100:
    - - gcp-v100
      - 4
    gcp-l4:
      - - gcp-l4
        - 4
    gcp-cpu:
    - - gcp-cpu
      - 4
  resource_configurations:
    gcp-v100:
      disk_size: 300
      instance_type: n1-highmem-8
      source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
      accelerator_type: nvidia-tesla-v100
    gcp-l4:
      disk_size: 300
      instance_type: g2-standard-12
      source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
      accelerator_type: nvidia-l4
    gcp-cpu:
      disk_size: 4000
      instance_type: c2-standard-4
      source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
      cpu_only: True
hyper_params:
  gcp_project: xxxxxxxxxxxxxxxxxx
  region: 'us-central1'
  zone: 'us-central1-a'
  cloud_credentials_key: xxxxxxxxxxx
  cloud_credentials_region: xxxxxxxxx
  cloud_credentials_secret: xxxxxxxxxxxxxxxxxxxxxx
  use_credentials_chain: false
  cloud_provider: ''
  default_docker_image: xxxxxxxxxxxxxxxxx
  git_pass: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  git_user: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  max_idle_time_min: 30
  max_spin_up_time_min: 30
  polling_interval_time_min: 0.5
  workers_prefix: 'gcp'
  iam_arn: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

Hi @<1523704674534821888:profile|SourLion48> , making sure I understand - You push a job into a queue that an autoscaler is listening to. A machine is spun up by the autoscaler and takes the job and it runs. Afterwards during the idle time, you push another job to the same queue, it is picked up by the machine that was spun up by the autoscaler and that one will fail?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

@<1523704674534821888:profile|SourLion48> , what versions of clearml , the autoscaler (latest commit?) and server you're using? Also, a configuration of the autoscaler would be helpful

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

Try to set agent.enable_git_ask_pass: true for the agent running inside the container, perhaps that will help

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

That is the configuration yaml.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

I didn't write this conf, but it works.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

or can I put something like the following in clearml.conf?

cat << EOF >> ~/clearml.conf
agent.enable_git_ask_pass=true
agent.git_user="{GIT_USER}"
agent.git_pass="{GIT_PASSWORD}"

The

agent.git_user="{GIT_USER}"
agent.git_pass="{GIT_PASSWORD}"

already existed. I only added

`agent.enable_git_ask_pass=true

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

I have started the autoscaler with python? @<1523701070390366208:profile|CostlyOstrich36>

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

I'm not sure I understand this config, is this an autoscaler for GCP or AWS?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

import yaml
from clearml.automation.auto_scaler import AutoScaler, ScalerConfig
from gcp_driver import GCPDriver

with open('gcp_autoscaler.yaml') as f:
    conf = yaml.load(f, Loader=yaml.SafeLoader)

    driver = GCPDriver.from_config(conf)

    conf = ScalerConfig.from_config(conf)
    autoscaler = AutoScaler(conf, driver)
    autoscaler.start()

That is the python code.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

And are they the same tasks?

Are you using the OS autoscaler or the PRO version?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

created virtual environment CPython3.10.13.final.0-64 in 511ms
  creator CPython3Posix(dest=/root/.clearml/venvs-builds/3.10, clear=False, no_vcs_ignore=False, global=True)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.3.1, setuptools==69.0.2, wheel==0.42.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
Using cached repository in "/root/.clearml/vcs-cache/ai_dev.git.0081a6bc4d7afe6adde369e6aeab9406/ai_dev.git"
fatal: could not read Username for '

': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository. 
1) Make sure you pushed the requested commit:
(repository='git@github.com:iCardioAI/ai_dev.git', branch='mergetest', commit_id='b73e928267d027937f89ca80d21ad62357bacee5', tag='', docker_cmd='722044220531.dkr.ecr.us-west-2.amazonaws.com/models:training_image_tf2_13', entry_point='model_train_task.py', working_dir='clearml_tasks/model_training')
2) Check if remote-worker has valid credentials [see worker configuration file]

Does it get the previously cloned code and then git pull from there?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SourLion48
				
					0
					 × 1

Write your answer

2K Views

25 Answers

one year ago