Just updating here that I got the AWS autoscaler working with CostlyOstrich36 ’s generous help 🎉
I thought I'd share here some details in case others experience similar difficulties
With regards to permissions, this is the list of actions that the autoscaler uses which your aws account would need to permit:GetConsoleOutput RequestSpotInstances DescribeSpotInstanceRequests RunInstances DescribeInstances TerminateInstances DescribeInstances
the instance image ami-04c0416d6bd8e4b1f
is no longer available to new users. you will need different images that match:
The machine architectures of your chosen machines The region you specified in your aws credentials The region that you specified in your resource definitions (it think the that aws credentials region and this one have to match)Otherwise you'll get the image ... does not exist
errors or an "image doesn't match the instance architecture" error (if once the is found).
As I understand it, when using pipelines, you'd probably want cpu-only instances for the services
queue and GPU-sporting instances for the default
queue (or any queue that runs pipeline components). This means different machine architectures and different instance images as well !
The default docker image that currently appears in the definition popups, nvidia/cuda:10.2-runtime-ubuntu18.04
, is somewhat outdated. Aside from possible problems with current packages that use GPUs, it uses python 3.6 and this led to package install failures when the clearml agent is brought up within the docker container and starts installing python packages .
Here is a configuration (as reported on the autoscaler task's configuration tab under resource_configurations
) that worked for me with AWS credentials that specify us-east-1
as the region :[ { "resource_name": "aws_default", "instance_type": "g4dn.2xlarge", "cpu_only": false, "is_spot": false, "regular_instance_rollback": false, "regular_instance_rollback_timeout": null, "availability_zone": "us-east-1a", "ami_id": "ami-003f25e6e2d2db8f1", "num_instances": 3, "queue_name": "default", "tags": "owner=lavi", "ebs_device_name": "/dev/sda1", "ebs_volume_size": 500, "ebs_volume_type": "gp3", "key_name": null, "security_group_ids": null, "subnet_id": null }, { "resource_name": "aws_services", "instance_type": "m5.large", "cpu_only": true, "is_spot": false, "regular_instance_rollback": false, "regular_instance_rollback_timeout": null, "availability_zone": "us-east-1a", "ami_id": "ami-040d909ea4e56f8f3", "num_instances": 2, "queue_name": "services", "tags": "owner=lavi", "ebs_device_name": "/dev/sda1", "ebs_volume_size": 500, "ebs_volume_type": "gp3", "key_name": null, "security_group_ids": null, "subnet_id": null } ]
using base docker image nvidia/cuda:11.2.2-runtime-ubuntu20.0
4
Perhaps the defaults https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py should be updated?