Hi All, I Am Trying To Spin Up Some Aws Autoscaler Instances, But I Seem To Have Some Issues With The Instance Creation:

Answered

Hi all,
I am trying to spin up some AWS autoscaler instances, but I seem to have some issues with the instance creation:


2023-02-23 21:04:29,122 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws-t3.medium', prefix='aws_cpu_0', queue='cpu-queue'
2023-02-23 21:04:29,122 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 21:04:29,123 - clearml.Auto-Scaler - INFO - monitor spots started
2023-02-23 21:04:29,162 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws-t3.medium
2023-02-23 21:04:29,164 - clearml.Auto-Scaler - WARNING - spinning up worker without specific subnet or availability zone FAILED
2023-02-23 21:04:29,164 - clearml.Auto-Scaler - ERROR - Failed to start new instance (resource 'aws-t3.medium'), Error: Parameter validation failed:
Invalid type for parameter ImageId, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
2023-02-23 21:04:29,515 - clearml.Auto-Scaler - INFO - --- Cloud instances (0) ---
2023-02-23 21:04:30,112 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds

I did follow the steps here: None to give myself the policies needed; I went with the simple approach with just:

{                  
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:TerminateInstances",
                "ec2:RequestSpotInstances",
                "ec2:DeleteTags",
                "ec2:CreateTags",
                "ec2:RunInstances",
                "ec2:DescribeSpotInstanceRequests",
                "ec2:GetConsoleOutput"
            ],
            "Resource": "*"
        }
    ]
}

and attached to myself.

In the wizard I only fill out the required fields marked with *.

I don't set any IAM instance profile, or VPC subnet id, no availability zone, no ami id. What part might be wrong to create an instance on AWS?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CheerfulKoala77
				
					0
					 × 1

Votes Newest

Answers 18

@<1539780258050347008:profile|CheerfulKoala77> make sure the AMI id matches the zone of the EC2 machine

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Any recommendation or working combinations of AMI

I would take the deeplearning AMIs from Nvidia AWS , I think they work on both CPU and GPU machines.
In terms of dockers, python dockers for CPU and nvidia runtime for GPU
https://hub.docker.com/layers/library/python/3.11.2-bullseye/images/sha256-6128ea86d[…]d2c01646d599352f6ddd9893420eb815a06c3b90619f8?context=explore

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sure go to the "All Projects" and filter by Task Type, application / service

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Seems like you're missing an image definition (AMI or otherwise)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Not quite sure how to proceed. Any suggestion how a working combination would look like would be appreciated. Also those NVIDIA AMI seem to be mainly for large instances with GPU not sure if it's possible to run them also on a CPU? @<1523701205467926528:profile|AgitatedDove14>

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CheerfulKoala77
				
					0
					 × 1

None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

That experiment says it's completed, does it mean that the autoscaler is running or not?

Not running, it will be "running" if actually being executed

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay can I somehow query how many manually/scripted created autoscaler I have and how would I delete them again? Is there a way to query the status and potentially some console output of those manually/scripted created autoscaler?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CheerfulKoala77
				
					0
					 × 1

Unfortunately it does not support changing the configuration "live"

That's okay, that's not so important to me. I'm mainly interested to see how many autoscaler I have currently active and which one I have. But in the application tab I only see the ones that I created online:
I don't seem to be able to track the ones that I created with that script. Do I understand something wrong?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CheerfulKoala77
				
					0
					 × 1

Also this message suggests that I can change the configuration, but as said I can't find it anywhere and wouldn't know hot to change the configuration.

This means that you can launch a new one (i.e. abort, clone, edit, enqueue) directly from the web UI and in the UI edit the configuration. Unfortunately it does not support changing the configuration "live"

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Okay, I see the picture below is that what you referring to? That experiment says it's completed, does it mean that the autoscaler is running or not? For me it sounds like the starting of the service is completed but I don't really see if the autoscaler is actually running. Also I don't see any output in the console of the autoscaler.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CheerfulKoala77
				
					0
					 × 1

Okay great i think i got it to work now with this AMI: ami-0c17f9e857dbd4c40 and with the python:3.10.10-bullseye dockerimage

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CheerfulKoala77
				
					0
					 × 1

My next issue is now to create the autoscaler via this script . The script runs through and I see a task which finishes successfull.
But I can't find the autoscaler anywhere on the WebUi.

Also this message suggests that I can change the configuration, but as said I can't find it anywhere and wouldn't know hot to change the configuration.

        print("AWS Autoscaler setup wizard\n"
              "---------------------------\n"
              "Follow the wizard to configure your AWS auto-scaler service.\n"
              "Once completed, you will be able to view and change the configuration in the clearml-server web UI.\n"
              "It means there is no need to worry about typos or mistakes :)\n")

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CheerfulKoala77
				
					0
					 × 1

or me it sounds like the starting of the service is completed but I don't really see if the autoscaler is actually running. Also I don't see any output in the console of the autoscaler.

Do notice the autoscaler code itself needs to run somewhere, by default it will be running on your machine, or on a remote agent,

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1539780258050347008:profile|CheerfulKoala77> you may also need to define subnet or security groups.
Personally I do not see the point in Docker over EC2 instances for CPU instances (virtualization on top of virtualization).
Finally, just to make sure, you only ever need one autoscaler. You can monitor multiple queues with multiple instance types with one autoscaler.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Thanks @<1523701083040387072:profile|UnevenDolphin73> , I realized that with a specified AMI it works a bit better. I tried with this one: ami-0735c191cf914754d ; which seems to be one of the standard AMIs. But also in that case the instance just freezes:

I also tried: ami-0f1a5f5ada0e7da53 which is [ Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type ]

2023-02-23 22:10:16,862 - clearml.Auto-Scaler - INFO - Autoscaler started
2023-02-23 22:10:16,862 - clearml.Auto-Scaler - INFO - state change: State.READY -> State.RUNNING
2023-02-23 22:10:17,611 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws_t3.medium', prefix='aws-cpu-9', queue='cpu-queue'
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - monitor spots started
2023-02-23 22:10:17,655 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws_t3.medium
2023-02-23 22:10:18,006 - clearml.Auto-Scaler - INFO - --- Cloud instances (0) ---
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
2023-02-23 22:10:18,613 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 22:10:19,242 - clearml.Auto-Scaler - INFO - New instance i-0a8239d165fbac686 listening to cpu-queue queue
2023-02-23 14:11:20
2023-02-23 22:11:16,606 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 22:11:19,117 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:11:19,465 - clearml.Auto-Scaler - INFO - --- Cloud instances (1) ---
2023-02-23 22:11:19,466 - clearml.Auto-Scaler - INFO - i-0a8239d165fbac686, regular
2023-02-23 22:11:19,890 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 14:12:21
2023-02-23 22:12:16,632 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 22:12:20,349 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:12:20,682 - clearml.Auto-Scaler - INFO - --- Cloud instances (1) ---
2023-02-23 22:12:20,683 - clearml.Auto-Scaler - INFO - i-0a8239d165fbac686, regular
2023-02-23 22:12:21,017 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 14:13:17
2023-02-23 22:13:16,660 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 14:13:22
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2023-02-23 22:13:21,498 - clearml.Auto-Scaler - INFO - Spinning down stuck worker: 'aws-cpu-9:aws_t3.medium:t3.medium:i-0a8239d165fbac686'
2023-02-23 22:13:21,934 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'aws-cpu-9:aws_t3.medium:t3.medium:i-0a8239d165fbac686'
2023-02-23 22:13:21,968 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:13:21,974 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws_t3.medium', prefix='aws-cpu-9', queue='cpu-queue'
2023-02-23 22:13:21,984 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 22:13:22,001 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws_t3.medium

Also I don't quite know what is a reasonable docker image for running some CPU processing with python - For the ClearML GPU Compute I use nvidia/cuda:11.4.3-runtime-ubuntu20.04 which seems to be working fine.

But I would like to try bigger GPU machines with AWS and also some CPU loads from AWS. Any recommendation or working combinations of AMI and Docker image?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CheerfulKoala77
				
					0
					 × 1

I tried the following AMI:

ami-0a4f5a73cdd47fd59 and ami-0dacd2425b81201fb

Error: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-0a4f5a73cdd47fd59]' does not exist

 Error: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-0dacd2425b81201fb]' does not exist

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CheerfulKoala77
				
					0
					 × 1

Yes the one you create manually is not really of the same "type" as the one you create online, this is why you do not see it there 😞

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

18 Answers

2 years ago