Reputation
Badges 1
979 × Eureka!TimelyPenguin76 That sounds amazing! will there be a fallback mechanism as well? often p3.2xlarge are on shortage, would be nice to define one resources req as first choice (eg. p3.2xlarge) -> if not available -> use another resources req (eg. g4dn)
TimelyPenguin76 , no, Iβve only set the sdk.aws.s3.region = eu-central-1
param
There is no need to add creds on the machine, since the EC2 instance has an attached IAM profile that grants access to s3. Boto3 is able retrieve the files from the s3 bucket
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?
Yea I really need that feature, I need to move away from key/secrets to iam roles
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: false
So it states that IAM role using metadata service should be supported, right?
I will go for lunch actually π back in ~1h
Configuration:
` {
"resource_configurations": {
"v100": {
"instance_type": "g4dn.2xlarge",
"availability_zone": "us-east-1a",
"ami_id": "ami-05e329519be512f1b",
"ebs_device_name": "/dev/sda1",
"ebs_volume_size": 100,
"ebs_volume_type": "gp3",
"key_name": "key.name",
"security_group_ids": [
"sg-asd"
],
"is_spot": false,
"extra_configura...
Nevermind, i was able to make it work, but no idea how
with 1.1.1 I getUser aborted: stopping task (3)
no, at least not in clearml-server version 1.1.1-135 β’ 1.1.1 β’ 2.14
no, I think I could reproduce with multiple queues
AnxiousSeal95 The main reason for me to not use clearml-serving triton is the lack of documentation tbh π I am not sure how to make my pytorch model run there
Hi CostlyOstrich36 ! no I am running on venv mode
to pass secrets to each experiment
automatically promote models to be served from within clearml
Yes!
I am also interested in the clearml-serving part π
Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you wonβt be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume
, and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clea...
yes but they are in plain text and I would like to avoid that
That would be amazing!