@<1523701070390366208:profile|CostlyOstrich36> I have been exploring. The problem seems to be when the docker container is using the cached dir.
Using cached repository in "/root/.clearml/vcs-cache/****.git.0081a6bc4d7afe6adde369e6aeab9406/****.git"
When inside that directory and tries to fetch, it asks for credentials. when it clones, it doesn't.
cloning: git@github.com:****/****.git
Using user/pass credentials - replacing ssh url 'git@github.com:****/****.git' with https url '
'
it uses cached repo when I start from an already started worker. It clones when it spins up a worker. The cloning and fetching happens in docker container for both cases.
The question is, how can I stop the cache so that it always clones the repo.
or can I put something like the following in clearml.conf?
cat << EOF >> ~/clearml.conf
agent.enable_git_ask_pass=true
agent.git_user="{GIT_USER}"
agent.git_pass="{GIT_PASSWORD}"
The
agent.git_user="{GIT_USER}"
agent.git_pass="{GIT_PASSWORD}"
already existed. I only added
`agent.enable_git_ask_pass=true
created virtual environment CPython3.10.13.final.0-64 in 511ms
creator CPython3Posix(dest=/root/.clearml/venvs-builds/3.10, clear=False, no_vcs_ignore=False, global=True)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
added seed packages: pip==23.3.1, setuptools==69.0.2, wheel==0.42.0
activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
Using cached repository in "/root/.clearml/vcs-cache/ai_dev.git.0081a6bc4d7afe6adde369e6aeab9406/ai_dev.git"
fatal: could not read Username for '
': terminal prompts disabled
error: Could not fetch origin
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
clearml_agent: ERROR: Failed cloning repository.
1) Make sure you pushed the requested commit:
(repository='git@github.com:iCardioAI/ai_dev.git', branch='mergetest', commit_id='b73e928267d027937f89ca80d21ad62357bacee5', tag='', docker_cmd='722044220531.dkr.ecr.us-west-2.amazonaws.com/models:training_image_tf2_13', entry_point='model_train_task.py', working_dir='clearml_tasks/model_training')
2) Check if remote-worker has valid credentials [see worker configuration file]
Does it get the previously cloned code and then git pull from there?
@<1523701070390366208:profile|CostlyOstrich36> I don't get that.
Try to set agent.enable_git_ask_pass: true
for the agent running inside the container, perhaps that will help
import yaml
from clearml.automation.auto_scaler import AutoScaler, ScalerConfig
from gcp_driver import GCPDriver
with open('gcp_autoscaler.yaml') as f:
conf = yaml.load(f, Loader=yaml.SafeLoader)
driver = GCPDriver.from_config(conf)
conf = ScalerConfig.from_config(conf)
autoscaler = AutoScaler(conf, driver)
autoscaler.start()
That is the python code.
You are right. My colleague wrote it I think starting from the aws autoscaler.
Where did you get this autoscaler? I don't think a GCP autoscaler was released in the open source
It is an autoscaler for gcp. I think, there are unnecessary configs that were used in aws.
I'm not sure I understand this config, is this an autoscaler for GCP or AWS?
@<1523701070390366208:profile|CostlyOstrich36>
I don't say I am expert in this thing, but does the autoscaler have different version than the clearml?
configurations:
extra_clearml_conf: 'sdk.aws.s3.region="us-west-2"
agent.extra_docker_arguments=["--shm-size=90g"]
agent.extra_docker_shell_script=["git config --global credential.helper cache --timeout=604800",]'
extra_trains_conf: ''
extra_vm_bash_script: ''
queues:
gcp-v100:
- - gcp-v100
- 4
gcp-l4:
- - gcp-l4
- 4
gcp-cpu:
- - gcp-cpu
- 4
resource_configurations:
gcp-v100:
disk_size: 300
instance_type: n1-highmem-8
source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
accelerator_type: nvidia-tesla-v100
gcp-l4:
disk_size: 300
instance_type: g2-standard-12
source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
accelerator_type: nvidia-l4
gcp-cpu:
disk_size: 4000
instance_type: c2-standard-4
source_image: projects/ml-images/global/images/c0-deeplearning-common-gpu-v20231105-debian-11-py310
cpu_only: True
hyper_params:
gcp_project: xxxxxxxxxxxxxxxxxx
region: 'us-central1'
zone: 'us-central1-a'
cloud_credentials_key: xxxxxxxxxxx
cloud_credentials_region: xxxxxxxxx
cloud_credentials_secret: xxxxxxxxxxxxxxxxxxxxxx
use_credentials_chain: false
cloud_provider: ''
default_docker_image: xxxxxxxxxxxxxxxxx
git_pass: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
git_user: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
max_idle_time_min: 30
max_spin_up_time_min: 30
polling_interval_time_min: 0.5
workers_prefix: 'gcp'
iam_arn: xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
@<1523704674534821888:profile|SourLion48> , what versions of clearml
, the autoscaler (latest commit?) and server you're using? Also, a configuration of the autoscaler would be helpful
I have started the autoscaler with python? @<1523701070390366208:profile|CostlyOstrich36>
I don't think I am on the pro version, is that a paid one?
they are different tasks. I start a new task but it can be same commit sometimes.
And are they the same tasks?
I mean if you were to run the 'failing' task first, it would run, correct?
And are they the same tasks?
Are you using the OS autoscaler or the PRO version?
Exactly, @<1523701070390366208:profile|CostlyOstrich36>
Hi @<1523704674534821888:profile|SourLion48> , making sure I understand - You push a job into a queue that an autoscaler is listening to. A machine is spun up by the autoscaler and takes the job and it runs. Afterwards during the idle time, you push another job to the same queue, it is picked up by the machine that was spun up by the autoscaler and that one will fail?