Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Gcp Autoscaler Limits Not Working Correctly?

GCP AutoScaler limits not working correctly?

Hi there,

I have encountered some unexpected behaviour with the GCP Autoscaler.

The AutoScaler does not appear to be sticking to the limits which I enforced (having a maximum of 12 instances spun up at one time). Please see attached screenshot.

The spinning up of the instances was trigger by me adding 21 tasks to the queue gcp-cpu-e2-highmem-4-ondemand at the same time. I have added the relevant logging from this morning in a file.

Has anyone experienced anything similar to this happening in the past? Is there anyway I can prevent this on my side or is this a bug in the ClearML Autoscaler?

Cheers,
James
image

  
  
Posted one year ago
Votes Newest

Answers 7


Apologies for the delay.

I have obfuscated the private information with XXX . Let me know if you think any of it is relevant.

{"gcp_project_id":"XXX","gcp_zone":"XXX","subnetwork":"XXX","gcp_credentials":"{\n  \"type\": \"service_account\",\n  \"project_id\": \"XXX\",\n  \"private_key_id\": \"XXX\",\n  \"private_key\": \"XXX\",\n  \"client_id\": \"XXX\",\n  \"auth_uri\": \"XXX\",\n  \"token_uri\": \"XXX\",\n  \"auth_provider_x509_cert_url\": \"XXX\",\n  \"client_x509_cert_url\": \"XXX\",\n  \"universe_domain\": \"XXX\"\n}","git_user":"XXX","git_pass":"XXX","default_docker_image":"XXX","instance_queue_list":[{"resource_name":"gcp-cpu-e2-highmem-4-ondemand","machine_type":"e2-highmem-4","cpu_only":true,"gpu_type":"nvidia-tesla-a100","gpu_count":0,"preemptible":false,"regular_instance_rollback":false,"regular_instance_rollback_timeout":10,"spot_instance_blackout_period":0,"num_instances":12,"queue_name":"gcp-cpu-e2-highmem-4-ondemand","source_image":"projects/deeplearning-platform-release/global/images/common-cpu-v20231105-ubuntu-2004-py310","disk_size_gb":100,"service_account_email":"default"},{"resource_name":"gcp-cpu-e2-medium-ondemand","machine_type":"e2-medium","cpu_only":true,"gpu_type":null,"gpu_count":0,"preemptible":false,"regular_instance_rollback":false,"regular_instance_rollback_timeout":10,"spot_instance_blackout_period":0,"num_instances":10,"queue_name":"gcp-cpu-e2-medium-ondemand","source_image":"projects/deeplearning-platform-release/global/images/common-cpu-v20231105-ubuntu-2004-py310","disk_size_gb":50,"service_account_email":"default"}],"name":"CPU Autoscaler","max_idle_time_min":60,"workers_prefix":"dynamic_gcp_cpu","polling_interval_time_min":"1","alert_on_multiple_workers_per_task":true,"exclude_bashrc":false,"custom_script":"XXX","extra_clearml_conf":"agent.extra_docker_arguments: [\"--ipc=host\", ]\n\nsdk.development.log_os_environments: [\"AWS_\"]\n\nagent.apply_environment: true\n\nenvironment {\n    XXX\n    XXX\n}\n\n\nsdk {\n    aws {\n        s3 {\n            credentials: [\n                {\n                    bucket: \"XXX\"\n                    key: \"XXX\"\n                    secret: \"XXX\"\n                }\n            ]\n        }\n        boto3 {\n            pool_connections: 512\n            max_multipart_concurrency: 16\n        }\n    }\n \n    development {\n        worker {\n            report_event_flush_threshold: 1000\n        }\n    }\n}\n\nagent {\n    default_docker: {\n        arguments: [\"--shm-size\", \"12G\", \"-p\", \"5000:5000\"]\n    }\n}"}
  
  
Posted one year ago

Hi @<1529271085315395584:profile|AmusedCat74> , thanks for reporting this, I'll ask the ClearML team to look into this

  
  
Posted one year ago

Let me know if you need additional information.
image

  
  
Posted one year ago

@<1529271085315395584:profile|AmusedCat74> can you share the autoscaler configuration?

  
  
Posted one year ago

Cheers 👍

  
  
Posted one year ago

@<1529271085315395584:profile|AmusedCat74> ?

  
  
Posted one year ago

I see you have two resources defined there - can you simply click on the triple-dot icon on the autoscaler instance and choose "Export Configuration", than share it here? (please note to remove any credentials from the generated file)

  
  
Posted one year ago
698 Views
7 Answers
one year ago
one year ago
Tags
Similar posts