Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone, I Have Some Questions Regarding Clearml Aws_Autoscaler.Py.

Hi everyone,
I have some questions regarding clearml aws_autoscaler.py.

First one:

On AWS machine agent runs with this command:
python -m clearml_agent --config-file /root/clearml.conf daemon --queue aws4gpu --docker nvidia/cuda:12.2.0-runtime-ubuntu22.04

However, the container spawned is:
805e06f198e8 nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04 "bash -c 'echo 'Bina…'" 20 seconds ago Up Less than a second sweet_carson

It's the previous container I've used for the task. Now I changed it in the configurations, but the change seems not to apply.

My /root/clearml.conf configuration is:

agent.git_user = "username"
agent.git_pass = "ghp_***"

agent.git_user = "username"
agent.git_pass = "ghp_***"

agent.git_user = "username"
agent.git_pass = "ghp_***"

sdk {
  aws {
    s3 {
      key: "***"
      secret: "***"
    }
  }
  agent {
    default_docker: {
      image: "nvidia/cuda:12.2.0-runtime-ubuntu22.04"
    }
  }
}

Second one:

For some reason, I'm not getting my AWS credentials imported inside the task container. I'm getting this error while trying to download weights from S3:

"Error downloading *bucket*/*path*.pth Reason: Unable to locate credentials."

However, my ClearML conf inside the container ( ~/default_clearml.conf & /tmp/clearml.conf ) contains such a section:

"aws": {
  "s3": {
    "key": "***",
    "secret": "***",
    "region": "",
    "credentials": []
  }
}

If I execute into the container and create ~/.aws/credentials manually, it works fine. But it's not persistent.

Third one:

Regarding running aws_autoscaler as a service, I'm encountering an error upon launching:

2023-10-12 18:56:53,223 - clearml.auto_scaler - INFO - Up machines: defaultdict(<class 'int'>, {})
2023-10-12 18:56:53,223 - clearml.auto_scaler - INFO - Idle for 60.00 seconds
ClearML Monitor: GPU monitoring failed to get GPU reading, switching off GPU monitoring
Process terminated by the user
clearml_agent: ERROR: [Errno 2] No such file or directory: '/tmp/.clearmlagent_1_kk3f8gxg.tmp'

I've checked permissions - everything seems fine, and temp files are created inside the directory. But something seems to be missing.

My aws_autoscaler.yaml looks like this:

configurations:
  extra_clearml_conf: |
    sdk {
      aws {
        s3 {
          key: "***"
          secret: "***"
        }
      }
    }
    agent {
      default_docker: {
        image: "nvidia/cuda:12.2.0-runtime-ubuntu22.04"
      }
    }
  extra_trains_conf: ''
  extra_vm_bash_script: ''
  queues:
    aws4gpu:
    - - aws4gpu
      - 3
  resource_configurations:
    aws4gpu:
      ami_id: ami-***
      availability_zone: eu-west-1a
      ebs_device_name: /dev/sda1
      ebs_volume_size: 100
      ebs_volume_type: gp3
      instance_type: g4dn.4xlarge
      is_spot: true
      key_name: ***
      security_group_ids:
      - sg-***
hyper_params:
  cloud_credentials_key: ***
  cloud_credentials_region: eu-west-1
  cloud_credentials_secret: ***
  cloud_provider: ''
  default_docker_image: nvidia/cuda:12.2.0-runtime-ubuntu22.04
  git_pass: ghp_***
  git_user: ***
  max_idle_time_min: 2
  max_spin_up_time_min: 30
  polling_interval_time_min: 1
  workers_prefix: dynamic_worker

Any ideas? Thanks.

  
  
Posted 8 months ago
Votes Newest

Answers 5


Hi @<1564785037834981376:profile|FrustratingBee69>

It's the previous container I've used for the task.

Notice that what you are configuring is the Default container, i.e. if the Task does not "request" a specific container, then this is what the agent will use.
On the Task itself (see Execution Tab, down below Container Image) you set the specific container for the Task. After you execute the Task on an Agent, the agent will put there the container it ended up using. This means that if you cloned a previous Task that was using the cuda-container, the cloned Task will use the same one, regardless of a diff default container

  
  
Posted 8 months ago

Oh, thank you! That's definitely the case.

I'll try to set some SDK parameters you mentioned in my other thread aswell :)

  
  
Posted 8 months ago

None

  
  
Posted 8 months ago

@<1523701205467926528:profile|AgitatedDove14>

Hi!

I guess all the problems are resolved.

First one: Task parameters, for sure.

Second one: I've looked into our ML engineer's code, and they were using boto3 to download S3 data, not the ClearML API. I guess that's the problem, but I'll find out for sure only tomorrow.

Third one: I've copied the configuration in the UI, and for some reason, it worked :woman-shrugging:

Thank you for your involvement!

  
  
Posted 8 months ago

Glad to hear! 🎉

  
  
Posted 8 months ago