Gcp Autoscaler Limits Not Working Correctly?

Answered

GCP AutoScaler limits not working correctly?

Hi there,

I have encountered some unexpected behaviour with the GCP Autoscaler.

The AutoScaler does not appear to be sticking to the limits which I enforced (having a maximum of 12 instances spun up at one time). Please see attached screenshot.

The spinning up of the instances was trigger by me adding 21 tasks to the queue gcp-cpu-e2-highmem-4-ondemand at the same time. I have added the relevant logging from this morning in a file.

Has anyone experienced anything similar to this happening in the past? Is there anyway I can prevent this on my side or is this a bug in the ClearML Autoscaler?

Cheers,
James

  				
Posted 
	one year ago

					More  		
  Report
		
					AmusedCat74
				
					0
					 × 1

Votes Newest

Answers 7

Hi AmusedCat74 , thanks for reporting this, I'll ask the ClearML team to look into this

  				
Posted 
	one year ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Cheers 👍

  				
Posted 
	one year ago

					More  		
  Report
		
					AmusedCat74
				
					0
					 × 1

AmusedCat74 can you share the autoscaler configuration?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Let me know if you need additional information.

  				
Posted 
	one year ago

					More  		
  Report
		
					AmusedCat74
				
					0
					 × 1

I see you have two resources defined there - can you simply click on the triple-dot icon on the autoscaler instance and choose "Export Configuration", than share it here? (please note to remove any credentials from the generated file)

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

AmusedCat74 ?

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Apologies for the delay.

I have obfuscated the private information with XXX . Let me know if you think any of it is relevant.

{"gcp_project_id":"XXX","gcp_zone":"XXX","subnetwork":"XXX","gcp_credentials":"{\n  \"type\": \"service_account\",\n  \"project_id\": \"XXX\",\n  \"private_key_id\": \"XXX\",\n  \"private_key\": \"XXX\",\n  \"client_id\": \"XXX\",\n  \"auth_uri\": \"XXX\",\n  \"token_uri\": \"XXX\",\n  \"auth_provider_x509_cert_url\": \"XXX\",\n  \"client_x509_cert_url\": \"XXX\",\n  \"universe_domain\": \"XXX\"\n}","git_user":"XXX","git_pass":"XXX","default_docker_image":"XXX","instance_queue_list":[{"resource_name":"gcp-cpu-e2-highmem-4-ondemand","machine_type":"e2-highmem-4","cpu_only":true,"gpu_type":"nvidia-tesla-a100","gpu_count":0,"preemptible":false,"regular_instance_rollback":false,"regular_instance_rollback_timeout":10,"spot_instance_blackout_period":0,"num_instances":12,"queue_name":"gcp-cpu-e2-highmem-4-ondemand","source_image":"projects/deeplearning-platform-release/global/images/common-cpu-v20231105-ubuntu-2004-py310","disk_size_gb":100,"service_account_email":"default"},{"resource_name":"gcp-cpu-e2-medium-ondemand","machine_type":"e2-medium","cpu_only":true,"gpu_type":null,"gpu_count":0,"preemptible":false,"regular_instance_rollback":false,"regular_instance_rollback_timeout":10,"spot_instance_blackout_period":0,"num_instances":10,"queue_name":"gcp-cpu-e2-medium-ondemand","source_image":"projects/deeplearning-platform-release/global/images/common-cpu-v20231105-ubuntu-2004-py310","disk_size_gb":50,"service_account_email":"default"}],"name":"CPU Autoscaler","max_idle_time_min":60,"workers_prefix":"dynamic_gcp_cpu","polling_interval_time_min":"1","alert_on_multiple_workers_per_task":true,"exclude_bashrc":false,"custom_script":"XXX","extra_clearml_conf":"agent.extra_docker_arguments: [\"--ipc=host\", ]\n\nsdk.development.log_os_environments: [\"AWS_\"]\n\nagent.apply_environment: true\n\nenvironment {\n    XXX\n    XXX\n}\n\n\nsdk {\n    aws {\n        s3 {\n            credentials: [\n                {\n                    bucket: \"XXX\"\n                    key: \"XXX\"\n                    secret: \"XXX\"\n                }\n            ]\n        }\n        boto3 {\n            pool_connections: 512\n            max_multipart_concurrency: 16\n        }\n    }\n \n    development {\n        worker {\n            report_event_flush_threshold: 1000\n        }\n    }\n}\n\nagent {\n    default_docker: {\n        arguments: [\"--shm-size\", \"12G\", \"-p\", \"5000:5000\"]\n    }\n}"}

  				
Posted 
	one year ago

					More  		
  Report
		
					AmusedCat74
				
					0
					 × 1

Write your answer

934 Views

7 Answers

one year ago