Hello Community. I'D Like To Try The Aws Autoscaler (I Actually Prefer To Try The Gcp One But I Think It'S Broken Or, At Least, I'Ve Failed To Make It Work So Far) I Can'T Find Documentation On What Permissions Would Be Required From An Aws Sub-Account

Unanswered

Just updating here that I got the AWS autoscaler working with CostlyOstrich36 ’s generous help 🎉

I thought I'd share here some details in case others experience similar difficulties

With regards to permissions, this is the list of actions that the autoscaler uses which your aws account would need to permit:
GetConsoleOutput RequestSpotInstances DescribeSpotInstanceRequests RunInstances DescribeInstances TerminateInstances DescribeInstancesthe instance image ami-04c0416d6bd8e4b1f is no longer available to new users. you will need different images that match:
The machine architectures of your chosen machines The region you specified in your aws credentials The region that you specified in your resource definitions (it think the that aws credentials region and this one have to match)Otherwise you'll get the image ... does not exist errors or an "image doesn't match the instance architecture" error (if once the is found).

As I understand it, when using pipelines, you'd probably want cpu-only instances for the services queue and GPU-sporting instances for the default queue (or any queue that runs pipeline components). This means different machine architectures and different instance images as well !

The default docker image that currently appears in the definition popups, nvidia/cuda:10.2-runtime-ubuntu18.04 , is somewhat outdated. Aside from possible problems with current packages that use GPUs, it uses python 3.6 and this led to package install failures when the clearml agent is brought up within the docker container and starts installing python packages .

Here is a configuration (as reported on the autoscaler task's configuration tab under resource_configurations) that worked for me with AWS credentials that specify us-east-1 as the region :
[ { "resource_name": "aws_default", "instance_type": "g4dn.2xlarge", "cpu_only": false, "is_spot": false, "regular_instance_rollback": false, "regular_instance_rollback_timeout": null, "availability_zone": "us-east-1a", "ami_id": "ami-003f25e6e2d2db8f1", "num_instances": 3, "queue_name": "default", "tags": "owner=lavi", "ebs_device_name": "/dev/sda1", "ebs_volume_size": 500, "ebs_volume_type": "gp3", "key_name": null, "security_group_ids": null, "subnet_id": null }, { "resource_name": "aws_services", "instance_type": "m5.large", "cpu_only": true, "is_spot": false, "regular_instance_rollback": false, "regular_instance_rollback_timeout": null, "availability_zone": "us-east-1a", "ami_id": "ami-040d909ea4e56f8f3", "num_instances": 2, "queue_name": "services", "tags": "owner=lavi", "ebs_device_name": "/dev/sda1", "ebs_volume_size": 500, "ebs_volume_type": "gp3", "key_name": null, "security_group_ids": null, "subnet_id": null } ]using base docker image nvidia/cuda:11.2.2-runtime-ubuntu20.0 4

Perhaps the defaults https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py should be updated?

  				
Posted 
	2 years ago

					More  		
  Report
		
					PanickyMoth78
				
					0
					 × 1

249 Views

0 Answers

2 years ago