Hi there. Does the AWS Autoscaler (PRO account, Web app) work?
I'm a bit confused how it should work. I'm able to run an EC2 instance via AWS Autoscaler but it doesn't process any queue items. Sometimes it terminates and spins up instances in a loop. In logs I see next message:
2024-01-09 17:48:38,707 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'dynamic_aws:clearml-autoscaler:g4dn.4xlarge:i-03c82eb66e06bb7ff'
Full example of logs:
2024-01-09 20:47:39
2024-01-09 17:47:37,176 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'clearml_gpu_compute_queue'
2024-01-09 17:47:37,520 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-03c82eb66e06bb7ff (regular)
2024-01-09 17:47:37,706 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2024-01-09 20:48:39
2024-01-09 17:48:38,205 - clearml.Auto-Scaler - INFO - Spinning down stuck worker dynamic_aws:clearml-autoscaler:g4dn.4xlarge:i-03c82eb66e06bb7ff from stale_spun
2024-01-09 17:48:38,707 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'dynamic_aws:clearml-autoscaler:g4dn.4xlarge:i-03c82eb66e06bb7ff'
2024-01-09 17:48:39,075 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'clearml_gpu_compute_queue'
2024-01-09 17:48:39,080 - clearml.Auto-Scaler - INFO - Spinning new instance resource='clearml-autoscaler', prefix='dynamic_aws', queue='clearml_gpu_compute_queue'
2024-01-09 17:48:39,080 - clearml.Auto-Scaler - INFO - Spinning up new instance in sunbnet subnet-03083b363bada22f4
2024-01-09 17:48:39,151 - clearml.Auto-Scaler - INFO - Creating regular instance for resource clearml-autoscaler
2024-01-09 17:48:39,459 - clearml.Auto-Scaler - INFO - --- Cloud instances (0):
2024-01-09 17:48:39,667 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2024-01-09 20:48:45
2024-01-09 17:48:40,541 - clearml.Auto-Scaler - INFO - New instance i-0e37b32eea9052cde listening to clearml_gpu_compute_queue queue
2024-01-09 20:49:45
2024-01-09 17:49:40,415 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'clearml_gpu_compute_queue'
2024-01-09 17:49:40,775 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-0e37b32eea9052cde (regular)
2024-01-09 17:49:41,083 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
My config is:
{"cloud_credentials_region":"us-west-2","cloud_credentials_key":"***","cloud_credentials_secret":"***","git_user":"***","git_pass":"***","max_idle_time_min":15,"workers_prefix":"dynamic_aws","polling_interval_time_min":"1","default_docker_image":"nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04","instance_queue_list":[{"resource_name":"clearml-autoscaler","instance_type":"t2.large","cpu_only":true,"is_spot":false,"regular_instance_rollback":false,"regular_instance_rollback_timeout":10,"spot_instance_blackout_period":null,"availability_zone":null,"ami_id":"ami-0025f0db847eb6254","num_instances":1,"queue_name":"clearml_gpu_compute_queue","tags":"Project=ds-ml-infrastructure,Environment=development,ClearML=true","ebs_device_name":"/dev/sda1","ebs_volume_size":50,"ebs_volume_type":"gp3","key_name":null,"security_group_ids":"sg-0269cba26b72f732d","subnet_id":"subnet-03a9c255dba0ec2e3"}],"use_iam_profile":false,"iam_arn":null,"iam_name":null,"name":"ClearML Autoscaler","alert_on_multiple_workers_per_task":true,"exclude_bashrc":false,"custom_script":"","extra_clearml_conf":"agent.extra_docker_arguments: [\"--ipc=host\", ]\nagent.package_manager.type = pip\nagent.package_manager.system_site_packages = true\n"}
Subnet is public, security group allows SSH connection over 22 port. But the created instance doesn't allocate a public IP. As a result SSH connection doesn't work. Does the AWS Autoscaler need SSH connection?
I don't see any errors in instance's logs. Queues aren't processed. Instances are terminated and created in a loop. I've tried the "ClearML GPU Compute" autoscaler - it works. But my goal is to run own instances on AWS. Does anybody have ideas where I'm wrong? Maybe @<1523701070390366208:profile|CostlyOstrich36> 🙂