Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi There. Does The

Hi there. Does the AWS Autoscaler (PRO account, Web app) work?

I'm a bit confused how it should work. I'm able to run an EC2 instance via AWS Autoscaler but it doesn't process any queue items. Sometimes it terminates and spins up instances in a loop. In logs I see next message:

2024-01-09 17:48:38,707 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'dynamic_aws:clearml-autoscaler:g4dn.4xlarge:i-03c82eb66e06bb7ff'

Full example of logs:

2024-01-09 20:47:39
2024-01-09 17:47:37,176 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'clearml_gpu_compute_queue'
2024-01-09 17:47:37,520 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-03c82eb66e06bb7ff (regular)
2024-01-09 17:47:37,706 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2024-01-09 20:48:39

2024-01-09 17:48:38,205 - clearml.Auto-Scaler - INFO - Spinning down stuck worker dynamic_aws:clearml-autoscaler:g4dn.4xlarge:i-03c82eb66e06bb7ff from stale_spun
2024-01-09 17:48:38,707 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'dynamic_aws:clearml-autoscaler:g4dn.4xlarge:i-03c82eb66e06bb7ff'
2024-01-09 17:48:39,075 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'clearml_gpu_compute_queue'
2024-01-09 17:48:39,080 - clearml.Auto-Scaler - INFO - Spinning new instance resource='clearml-autoscaler', prefix='dynamic_aws', queue='clearml_gpu_compute_queue'
2024-01-09 17:48:39,080 - clearml.Auto-Scaler - INFO - Spinning up new instance in sunbnet subnet-03083b363bada22f4
2024-01-09 17:48:39,151 - clearml.Auto-Scaler - INFO - Creating regular instance for resource clearml-autoscaler
2024-01-09 17:48:39,459 - clearml.Auto-Scaler - INFO - --- Cloud instances (0): 
2024-01-09 17:48:39,667 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2024-01-09 20:48:45

2024-01-09 17:48:40,541 - clearml.Auto-Scaler - INFO - New instance i-0e37b32eea9052cde listening to clearml_gpu_compute_queue queue
2024-01-09 20:49:45

2024-01-09 17:49:40,415 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'clearml_gpu_compute_queue'
2024-01-09 17:49:40,775 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-0e37b32eea9052cde (regular)
2024-01-09 17:49:41,083 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds

My config is:

{"cloud_credentials_region":"us-west-2","cloud_credentials_key":"***","cloud_credentials_secret":"***","git_user":"***","git_pass":"***","max_idle_time_min":15,"workers_prefix":"dynamic_aws","polling_interval_time_min":"1","default_docker_image":"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04","instance_queue_list":[{"resource_name":"clearml-autoscaler","instance_type":"t2.large","cpu_only":true,"is_spot":false,"regular_instance_rollback":false,"regular_instance_rollback_timeout":10,"spot_instance_blackout_period":null,"availability_zone":null,"ami_id":"ami-0025f0db847eb6254","num_instances":1,"queue_name":"clearml_gpu_compute_queue","tags":"Project=ds-ml-infrastructure,Environment=development,ClearML=true","ebs_device_name":"/dev/sda1","ebs_volume_size":50,"ebs_volume_type":"gp3","key_name":null,"security_group_ids":"sg-0269cba26b72f732d","subnet_id":"subnet-03a9c255dba0ec2e3"}],"use_iam_profile":false,"iam_arn":null,"iam_name":null,"name":"ClearML Autoscaler","alert_on_multiple_workers_per_task":true,"exclude_bashrc":false,"custom_script":"","extra_clearml_conf":"agent.extra_docker_arguments: [\"--ipc=host\", ]\nagent.package_manager.type = pip\nagent.package_manager.system_site_packages = true\n"}

Subnet is public, security group allows SSH connection over 22 port. But the created instance doesn't allocate a public IP. As a result SSH connection doesn't work. Does the AWS Autoscaler need SSH connection?

I don't see any errors in instance's logs. Queues aren't processed. Instances are terminated and created in a loop. I've tried the "ClearML GPU Compute" autoscaler - it works. But my goal is to run own instances on AWS. Does anybody have ideas where I'm wrong? Maybe @<1523701070390366208:profile|CostlyOstrich36> 🙂
image

  
  
Posted 9 months ago
Votes Newest

Answers 4


Thank you @<1523701087100473344:profile|SuccessfulKoala55> for the ideas. AMI is - ami-0025f0db847eb6254 , it may doesn't have docker, yeah - I will check out that.
My main assumption is that AWS Autoscaler can't establish SSH connection with a EC2 instance. Because the EC2 instance is created without public IP address (not sure if it's the clue).

  
  
Posted 9 months ago

The Autoscaler does not use SSH. However, the docker service is required on that instance, and I assume this is the reason why the instance is not starting up correctly

  
  
Posted 9 months ago

The issue was in my Terraform VPC configuration: I missed the enable_nat_gateway = true . Thus EC2 instance was not able to even update OS packages. The "Instance log files" on the AWS Autoscaler page pointed me to that issue.
SSH connection is not required, yep.

  
  
Posted 9 months ago

Hi @<1572032849320611840:profile|HurtRaccoon43> , I would bet that there's some issue when the instances start running - the easiest thing to do it grab they system log from the AWS console and share it here. Which AMI are you using? is it possible this AMI does not have docker preinstalled?

  
  
Posted 9 months ago