You are right. My colleague wrote it I think starting from the aws autoscaler.
I don't say I am expert in this thing, but does the autoscaler have different version than the clearml?
And are they the same tasks?
Are you using the OS autoscaler or the PRO version?
I have started the autoscaler with python? @<1523701070390366208:profile|CostlyOstrich36>
created virtual environment CPython3.10.13.final.0-64 in 511ms
creator CPython3Posix(dest=/root/.clearml/venvs-builds/3.10, clear=False, no_vcs_ignore=False, global=True)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
added seed packages: pip==23.3.1, setuptools==69.0.2, wheel==0.42.0
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
Using cached repository in "/root/.clearml/vcs-cache/ai_dev.git.0081a6bc4d7afe6adde369e6aeab9406/ai_dev.git"
fatal: could not read Username for '
': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='git@github.com:iCardioAI/ai_dev.git', branch='mergetest', commit_id='b73e928267d027937f89ca80d21ad62357bacee5', tag='', docker_cmd='722044220531.dkr.ecr.us-west-2.amazonaws.com/models:training_image_tf2_13', entry_point='model_train_task.py', working_dir='clearml_tasks/model_training')
2) Check if remote-worker has valid credentials [see worker configuration file]
Does it get the previously cloned code and then git pull from there?
I'm not sure I understand this config, is this an autoscaler for GCP or AWS?
@<1523704674534821888:profile|SourLion48> , what versions of clearml
, the autoscaler (latest commit?) and server you're using? Also, a configuration of the autoscaler would be helpful
And are they the same tasks?
I mean if you were to run the 'failing' task first, it would run, correct?
they are different tasks. I start a new task but it can be same commit sometimes.
configurations:
extra_clearml_conf: 'sdk.aws.s3.region="us-west-2"
agent.extra_docker_arguments=["--shm-size=90g"]
agent.extra_docker_shell_script=["git config --global credential.helper cache --timeout=604800",]'
extra_trains_conf: ''
extra_vm_bash_script: ''
queues:
gcp-v100:
- - gcp-v100
- 4
gcp-l4:
- - gcp-l4
- 4
gcp-cpu:
- - gcp-cpu
- 4
resource_configurations:
gcp-v100:
disk_size: 300
instance_type: n1-highmem-8
source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
accelerator_type: nvidia-tesla-v100
gcp-l4:
disk_size: 300
instance_type: g2-standard-12
source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
accelerator_type: nvidia-l4
gcp-cpu:
disk_size: 4000
instance_type: c2-standard-4
source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
cpu_only: True
hyper_params:
gcp_project: xxxxxxxxxxxxxxxxxx
region: 'us-central1'
zone: 'us-central1-a'
cloud_credentials_key: xxxxxxxxxxx
cloud_credentials_region: xxxxxxxxx
cloud_credentials_secret: xxxxxxxxxxxxxxxxxxxxxx
use_credentials_chain: false
cloud_provider: ''
default_docker_image: xxxxxxxxxxxxxxxxx
git_pass: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
git_user: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
max_idle_time_min: 30
max_spin_up_time_min: 30
polling_interval_time_min: 0.5
workers_prefix: 'gcp'
iam_arn: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Hi @<1523704674534821888:profile|SourLion48> , making sure I understand - You push a job into a queue that an autoscaler is listening to. A machine is spun up by the autoscaler and takes the job and it runs. Afterwards during the idle time, you push another job to the same queue, it is picked up by the machine that was spun up by the autoscaler and that one will fail?
Where did you get this autoscaler? I don't think a GCP autoscaler was released in the open source
@<1523701070390366208:profile|CostlyOstrich36>
Exactly, @<1523701070390366208:profile|CostlyOstrich36>
@<1523701070390366208:profile|CostlyOstrich36> I don't get that.
import yaml
from clearml.automation.auto_scaler import AutoScaler, ScalerConfig
from gcp_driver import GCPDriver
with open('gcp_autoscaler.yaml') as f:
conf = yaml.load(f, Loader=yaml.SafeLoader)
driver = GCPDriver.from_config(conf)
conf = ScalerConfig.from_config(conf)
autoscaler = AutoScaler(conf, driver)
autoscaler.start()
That is the python code.
or can I put something like the following in clearml.conf?
cat << EOF >> ~/clearml.conf
agent.enable_git_ask_pass=true
agent.git_user="{GIT_USER}"
agent.git_pass="{GIT_PASSWORD}"
The
agent.git_user="{GIT_USER}"
agent.git_pass="{GIT_PASSWORD}"
already existed. I only added
`agent.enable_git_ask_pass=true
It is an autoscaler for gcp. I think, there are unnecessary configs that were used in aws.
Try to set agent.enable_git_ask_pass: true
for the agent running inside the container, perhaps that will help
@<1523701070390366208:profile|CostlyOstrich36> I have been exploring. The problem seems to be when the docker container is using the cached dir.
Using cached repository in "/root/.clearml/vcs-cache/****.git.0081a6bc4d7afe6adde369e6aeab9406/****.git"
When inside that directory and tries to fetch, it asks for credentials. when it clones, it doesn't.
cloning: git@github.com:****/****.git
Using user/pass credentials - replacing ssh url 'git@github.com:****/****.git' with https url '
'
it uses cached repo when I start from an already started worker. It clones when it spins up a worker. The cloning and fetching happens in docker container for both cases.
The question is, how can I stop the cache so that it always clones the repo.
I don't think I am on the pro version, is that a paid one?