Hi Everyone, I Have Some Questions Regarding Clearml Aws_Autoscaler.Py.

Answered

Hi everyone,
I have some questions regarding clearml aws_autoscaler.py.

First one:

On AWS machine agent runs with this command:
python -m clearml_agent --config-file /root/clearml.conf daemon --queue aws4gpu --docker nvidia/cuda:12.2.0-runtime-ubuntu22.04

However, the container spawned is:
805e06f198e8 nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 "bash -c 'echo 'Bina…'" 20 seconds ago Up Less than a second sweet_carson

It's the previous container I've used for the task. Now I changed it in the configurations, but the change seems not to apply.

My /root/clearml.conf configuration is:

agent.git_user = "username"
agent.git_pass = "ghp_***"

agent.git_user = "username"
agent.git_pass = "ghp_***"

agent.git_user = "username"
agent.git_pass = "ghp_***"

sdk {
  aws {
    s3 {
      key: "***"
      secret: "***"
    }
  }
  agent {
    default_docker: {
      image: "nvidia/cuda:12.2.0-runtime-ubuntu22.04"
    }
  }
}

Second one:

For some reason, I'm not getting my AWS credentials imported inside the task container. I'm getting this error while trying to download weights from S3:

"Error downloading *bucket*/*path*.pth Reason: Unable to locate credentials."

However, my ClearML conf inside the container ( ~/default_clearml.conf & /tmp/clearml.conf ) contains such a section:

"aws": {
  "s3": {
    "key": "***",
    "secret": "***",
    "region": "",
    "credentials": []
  }
}

If I execute into the container and create ~/.aws/credentials manually, it works fine. But it's not persistent.

Third one:

Regarding running aws_autoscaler as a service, I'm encountering an error upon launching:

2023-10-12 18:56:53,223 - clearml.auto_scaler - INFO - Up machines: defaultdict(<class 'int'>, {})
2023-10-12 18:56:53,223 - clearml.auto_scaler - INFO - Idle for 60.00 seconds
ClearML Monitor: GPU monitoring failed to get GPU reading, switching off GPU monitoring
Process terminated by the user
clearml_agent: ERROR: [Errno 2] No such file or directory: '/tmp/.clearmlagent_1_kk3f8gxg.tmp'

I've checked permissions - everything seems fine, and temp files are created inside the directory. But something seems to be missing.

My aws_autoscaler.yaml looks like this:

configurations:
  extra_clearml_conf: |
    sdk {
      aws {
        s3 {
          key: "***"
          secret: "***"
        }
      }
    }
    agent {
      default_docker: {
        image: "nvidia/cuda:12.2.0-runtime-ubuntu22.04"
      }
    }
  extra_trains_conf: ''
  extra_vm_bash_script: ''
  queues:
    aws4gpu:
    - - aws4gpu
      - 3
  resource_configurations:
    aws4gpu:
      ami_id: ami-***
      availability_zone: eu-west-1a
      ebs_device_name: /dev/sda1
      ebs_volume_size: 100
      ebs_volume_type: gp3
      instance_type: g4dn.4xlarge
      is_spot: true
      key_name: ***
      security_group_ids:
      - sg-***
hyper_params:
  cloud_credentials_key: ***
  cloud_credentials_region: eu-west-1
  cloud_credentials_secret: ***
  cloud_provider: ''
  default_docker_image: nvidia/cuda:12.2.0-runtime-ubuntu22.04
  git_pass: ghp_***
  git_user: ***
  max_idle_time_min: 2
  max_spin_up_time_min: 30
  polling_interval_time_min: 1
  workers_prefix: dynamic_worker

Any ideas? Thanks.

  				
Posted 
	one year ago

					More  		
  Report
		
					FrustratingBee69
				
					0
					 × 1

Votes Newest

Answers 5

Hi FrustratingBee69

It's the previous container I've used for the task.

Notice that what you are configuring is the Default container, i.e. if the Task does not "request" a specific container, then this is what the agent will use.
On the Task itself (see Execution Tab, down below Container Image) you set the specific container for the Task. After you execute the Task on an Agent, the agent will put there the container it ended up using. This means that if you cloned a previous Task that was using the cuda-container, the cloned Task will use the same one, regardless of a diff default container

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Oh, thank you! That's definitely the case.

I'll try to set some SDK parameters you mentioned in my other thread aswell :)

  				
Posted 
	one year ago

					More  		
  Report
		
					FrustratingBee69
				
					0
					 × 1

None

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14

Hi!

I guess all the problems are resolved.

First one: Task parameters, for sure.

Second one: I've looked into our ML engineer's code, and they were using boto3 to download S3 data, not the ClearML API. I guess that's the problem, but I'll find out for sure only tomorrow.

Third one: I've copied the configuration in the UI, and for some reason, it worked :woman-shrugging:

Thank you for your involvement!

  				
Posted 
	one year ago

					More  		
  Report
		
					FrustratingBee69
				
					0
					 × 1

Glad to hear! 🎉

  				
Posted 
	one year ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

5 Answers

one year ago