Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello, I Have Clearml Autoscaler Setup. Previously, When A New Task Comes Up, An Already Running Worker (If There Is), Will Take It, Apply The New Commit And Run The Task. Now, I Get An Error. So, I Can'T Run A Task On An Already Running Worker. It Has To

Hello, I have clearml autoscaler setup. Previously, when a new task comes up, an already running worker (if there is), will take it, apply the new commit and run the task. Now, I get an error. So, I can't run a task on an already running worker. It has to start a new worker in order to work. The error is:

fatal: could not read Username for '
': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository. 
1) Make sure you pushed the requested commit:
(repository='git@github.com:iCardioAI/ai_dev.git', branch='mergetest', commit_id='47896212f9db6efa505add8cee04fc5d8e7daa6f', tag='', docker_cmd='722044220531.dkr.ecr.us-west-2.amazonaws.com/models:training_image_tf2_13', entry_point='task.py', working_dir='clearml_tasks/pipeline_evaluation')
2) Check if remote-worker has valid credentials [see worker configuration file]
  
  
Posted 10 months ago
Votes Newest

Answers 25


Hi @<1523704674534821888:profile|SourLion48> , making sure I understand - You push a job into a queue that an autoscaler is listening to. A machine is spun up by the autoscaler and takes the job and it runs. Afterwards during the idle time, you push another job to the same queue, it is picked up by the machine that was spun up by the autoscaler and that one will fail?

  
  
Posted 10 months ago

It is an autoscaler for gcp. I think, there are unnecessary configs that were used in aws.

  
  
Posted 10 months ago

@<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted 10 months ago

I'm not sure I understand this config, is this an autoscaler for GCP or AWS?

  
  
Posted 10 months ago

clearml==1.14.1
That is the version.

  
  
Posted 10 months ago

The worker machines are on gcp

  
  
Posted 10 months ago

@<1523701070390366208:profile|CostlyOstrich36> I have been exploring. The problem seems to be when the docker container is using the cached dir.

Using cached repository in "/root/.clearml/vcs-cache/****.git.0081a6bc4d7afe6adde369e6aeab9406/****.git"

When inside that directory and tries to fetch, it asks for credentials. when it clones, it doesn't.

cloning: git@github.com:****/****.git
Using user/pass credentials - replacing ssh url 'git@github.com:****/****.git' with https url '
'

it uses cached repo when I start from an already started worker. It clones when it spins up a worker. The cloning and fetching happens in docker container for both cases.

The question is, how can I stop the cache so that it always clones the repo.

  
  
Posted 10 months ago

I don't think I am on the pro version, is that a paid one?

  
  
Posted 10 months ago

configurations:
  extra_clearml_conf: 'sdk.aws.s3.region="us-west-2"
                        agent.extra_docker_arguments=["--shm-size=90g"]
                        agent.extra_docker_shell_script=["git config --global credential.helper cache --timeout=604800",]'
  extra_trains_conf: ''
  extra_vm_bash_script: ''
  queues:
    gcp-v100:
    - - gcp-v100
      - 4
    gcp-l4:
      - - gcp-l4
        - 4
    gcp-cpu:
    - - gcp-cpu
      - 4
  resource_configurations:
    gcp-v100:
      disk_size: 300
      instance_type: n1-highmem-8
      source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
      accelerator_type: nvidia-tesla-v100
    gcp-l4:
      disk_size: 300
      instance_type: g2-standard-12
      source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
      accelerator_type: nvidia-l4
    gcp-cpu:
      disk_size: 4000
      instance_type: c2-standard-4
      source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
      cpu_only: True
hyper_params:
  gcp_project: xxxxxxxxxxxxxxxxxx
  region: 'us-central1'
  zone: 'us-central1-a'
  cloud_credentials_key: xxxxxxxxxxx
  cloud_credentials_region: xxxxxxxxx
  cloud_credentials_secret: xxxxxxxxxxxxxxxxxxxxxx
  use_credentials_chain: false
  cloud_provider: ''
  default_docker_image: xxxxxxxxxxxxxxxxx
  git_pass: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  git_user: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  max_idle_time_min: 30
  max_spin_up_time_min: 30
  polling_interval_time_min: 0.5
  workers_prefix: 'gcp'
  iam_arn: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  
  
Posted 10 months ago

they are different tasks. I start a new task but it can be same commit sometimes.

  
  
Posted 10 months ago

import yaml
from clearml.automation.auto_scaler import AutoScaler, ScalerConfig
from gcp_driver import GCPDriver

with open('gcp_autoscaler.yaml') as f:
    conf = yaml.load(f, Loader=yaml.SafeLoader)

    driver = GCPDriver.from_config(conf)

    conf = ScalerConfig.from_config(conf)
    autoscaler = AutoScaler(conf, driver)
    autoscaler.start()

That is the python code.

  
  
Posted 10 months ago

Where did you get this autoscaler? I don't think a GCP autoscaler was released in the open source

  
  
Posted 10 months ago

That is the configuration yaml.

  
  
Posted 10 months ago

@<1523704674534821888:profile|SourLion48> , what versions of clearml , the autoscaler (latest commit?) and server you're using? Also, a configuration of the autoscaler would be helpful

  
  
Posted 10 months ago

created virtual environment CPython3.10.13.final.0-64 in 511ms
  creator CPython3Posix(dest=/root/.clearml/venvs-builds/3.10, clear=False, no_vcs_ignore=False, global=True)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.3.1, setuptools==69.0.2, wheel==0.42.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
Using cached repository in "/root/.clearml/vcs-cache/ai_dev.git.0081a6bc4d7afe6adde369e6aeab9406/ai_dev.git"
fatal: could not read Username for '
': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository. 
1) Make sure you pushed the requested commit:
(repository='git@github.com:iCardioAI/ai_dev.git', branch='mergetest', commit_id='b73e928267d027937f89ca80d21ad62357bacee5', tag='', docker_cmd='722044220531.dkr.ecr.us-west-2.amazonaws.com/models:training_image_tf2_13', entry_point='model_train_task.py', working_dir='clearml_tasks/model_training')
2) Check if remote-worker has valid credentials [see worker configuration file]

Does it get the previously cloned code and then git pull from there?

  
  
Posted 10 months ago

or can I put something like the following in clearml.conf?

cat << EOF >> ~/clearml.conf
agent.enable_git_ask_pass=true
agent.git_user="{GIT_USER}"
agent.git_pass="{GIT_PASSWORD}"

The

agent.git_user="{GIT_USER}"
agent.git_pass="{GIT_PASSWORD}"

already existed. I only added

`agent.enable_git_ask_pass=true
  
  
Posted 10 months ago

@<1523701070390366208:profile|CostlyOstrich36> I don't get that.

  
  
Posted 10 months ago

I didn't write this conf, but it works.

  
  
Posted 10 months ago

Try to set agent.enable_git_ask_pass: true for the agent running inside the container, perhaps that will help

  
  
Posted 10 months ago

And are they the same tasks?

I mean if you were to run the 'failing' task first, it would run, correct?

  
  
Posted 10 months ago

You are right. My colleague wrote it I think starting from the aws autoscaler.

  
  
Posted 10 months ago

I don't say I am expert in this thing, but does the autoscaler have different version than the clearml?

  
  
Posted 10 months ago

And are they the same tasks?

Are you using the OS autoscaler or the PRO version?

  
  
Posted 10 months ago

I have started the autoscaler with python? @<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted 10 months ago

Exactly, @<1523701070390366208:profile|CostlyOstrich36>

  
  
Posted 10 months ago
721 Views
25 Answers
10 months ago
9 months ago
Tags
Similar posts