Hi everyone,
I have some questions regarding clearml aws_autoscaler.py.
First one:
On AWS machine agent runs with this command:
python -m clearml_agent --config-file /root/clearml.conf daemon --queue aws4gpu --docker
nvidia/cuda:12.2.0-runtime-ubuntu22.04
However, the container spawned is:
805e06f198e8
nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04
"bash -c 'echo 'Bina…'" 20 seconds ago Up Less than a second sweet_carson
It's the previous container I've used for the task. Now I changed it in the configurations, but the change seems not to apply.
My /root/clearml.conf configuration is:
agent.git_user = "username"
agent.git_pass = "ghp_***"
agent.git_user = "username"
agent.git_pass = "ghp_***"
agent.git_user = "username"
agent.git_pass = "ghp_***"
sdk {
aws {
s3 {
key: "***"
secret: "***"
}
}
agent {
default_docker: {
image: "nvidia/cuda:12.2.0-runtime-ubuntu22.04"
}
}
}
Second one:
For some reason, I'm not getting my AWS credentials imported inside the task container. I'm getting this error while trying to download weights from S3:
"Error downloading *bucket*/*path*.pth Reason: Unable to locate credentials."
However, my ClearML conf inside the container ( ~/default_clearml.conf
& /tmp/clearml.conf
) contains such a section:
"aws": {
"s3": {
"key": "***",
"secret": "***",
"region": "",
"credentials": []
}
}
If I execute into the container and create ~/.aws/credentials
manually, it works fine. But it's not persistent.
Third one:
Regarding running aws_autoscaler
as a service, I'm encountering an error upon launching:
2023-10-12 18:56:53,223 - clearml.auto_scaler - INFO - Up machines: defaultdict(<class 'int'>, {})
2023-10-12 18:56:53,223 - clearml.auto_scaler - INFO - Idle for 60.00 seconds
ClearML Monitor: GPU monitoring failed to get GPU reading, switching off GPU monitoring
Process terminated by the user
clearml_agent: ERROR: [Errno 2] No such file or directory: '/tmp/.clearmlagent_1_kk3f8gxg.tmp'
I've checked permissions - everything seems fine, and temp files are created inside the directory. But something seems to be missing.
My aws_autoscaler.yaml looks like this:
configurations:
extra_clearml_conf: |
sdk {
aws {
s3 {
key: "***"
secret: "***"
}
}
}
agent {
default_docker: {
image: "nvidia/cuda:12.2.0-runtime-ubuntu22.04"
}
}
extra_trains_conf: ''
extra_vm_bash_script: ''
queues:
aws4gpu:
- - aws4gpu
- 3
resource_configurations:
aws4gpu:
ami_id: ami-***
availability_zone: eu-west-1a
ebs_device_name: /dev/sda1
ebs_volume_size: 100
ebs_volume_type: gp3
instance_type: g4dn.4xlarge
is_spot: true
key_name: ***
security_group_ids:
- sg-***
hyper_params:
cloud_credentials_key: ***
cloud_credentials_region: eu-west-1
cloud_credentials_secret: ***
cloud_provider: ''
default_docker_image: nvidia/cuda:12.2.0-runtime-ubuntu22.04
git_pass: ghp_***
git_user: ***
max_idle_time_min: 2
max_spin_up_time_min: 30
polling_interval_time_min: 1
workers_prefix: dynamic_worker
Any ideas? Thanks.