Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I Am Trying To Spin Up Some Aws Autoscaler Instances, But I Seem To Have Some Issues With The Instance Creation:

Hi all,
I am trying to spin up some AWS autoscaler instances, but I seem to have some issues with the instance creation:


2023-02-23 21:04:29,122 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws-t3.medium', prefix='aws_cpu_0', queue='cpu-queue'
2023-02-23 21:04:29,122 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 21:04:29,123 - clearml.Auto-Scaler - INFO - monitor spots started
2023-02-23 21:04:29,162 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws-t3.medium
2023-02-23 21:04:29,164 - clearml.Auto-Scaler - WARNING - spinning up worker without specific subnet or availability zone FAILED
2023-02-23 21:04:29,164 - clearml.Auto-Scaler - ERROR - Failed to start new instance (resource 'aws-t3.medium'), Error: Parameter validation failed:
Invalid type for parameter ImageId, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
2023-02-23 21:04:29,515 - clearml.Auto-Scaler - INFO - --- Cloud instances (0) ---
2023-02-23 21:04:30,112 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds

I did follow the steps here: None to give myself the policies needed; I went with the simple approach with just:

{                  
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances",
                "ec2:TerminateInstances",
                "ec2:RequestSpotInstances",
                "ec2:DeleteTags",
                "ec2:CreateTags",
                "ec2:RunInstances",
                "ec2:DescribeSpotInstanceRequests",
                "ec2:GetConsoleOutput"
            ],
            "Resource": "*"
        }
    ]
}

and attached to myself.

In the wizard I only fill out the required fields marked with *.

  • I don't set any IAM instance profile, or VPC subnet id, no availability zone, no ami id. What part might be wrong to create an instance on AWS?
  
  
Posted one year ago
Votes Newest

Answers 18


Okay, I see the picture below is that what you referring to? That experiment says it's completed, does it mean that the autoscaler is running or not? For me it sounds like the starting of the service is completed but I don't really see if the autoscaler is actually running. Also I don't see any output in the console of the autoscaler.
image

  
  
Posted one year ago

Seems like you're missing an image definition (AMI or otherwise)

  
  
Posted one year ago

or me it sounds like the starting of the service is completed but I don't really see if the autoscaler is actually running. Also I don't see any output in the console of the autoscaler.

Do notice the autoscaler code itself needs to run somewhere, by default it will be running on your machine, or on a remote agent,

  
  
Posted one year ago

That experiment says it's completed, does it mean that the autoscaler is running or not?

Not running, it will be "running" if actually being executed

  
  
Posted one year ago

My next issue is now to create the autoscaler via this script . The script runs through and I see a task which finishes successfull.
But I can't find the autoscaler anywhere on the WebUi.

Also this message suggests that I can change the configuration, but as said I can't find it anywhere and wouldn't know hot to change the configuration.

        print("AWS Autoscaler setup wizard\n"
              "---------------------------\n"
              "Follow the wizard to configure your AWS auto-scaler service.\n"
              "Once completed, you will be able to view and change the configuration in the clearml-server web UI.\n"
              "It means there is no need to worry about typos or mistakes :)\n")

image

  
  
Posted one year ago

Okay great i think i got it to work now with this AMI: ami-0c17f9e857dbd4c40 and with the python:3.10.10-bullseye dockerimage
image

  
  
Posted one year ago

@<1539780258050347008:profile|CheerfulKoala77> you may also need to define subnet or security groups.
Personally I do not see the point in Docker over EC2 instances for CPU instances (virtualization on top of virtualization).
Finally, just to make sure, you only ever need one autoscaler. You can monitor multiple queues with multiple instance types with one autoscaler.

  
  
Posted one year ago

None

  
  
Posted one year ago

Not quite sure how to proceed. Any suggestion how a working combination would look like would be appreciated. Also those NVIDIA AMI seem to be mainly for large instances with GPU not sure if it's possible to run them also on a CPU? @<1523701205467926528:profile|AgitatedDove14>

  
  
Posted one year ago

Also this message suggests that I can change the configuration, but as said I can't find it anywhere and wouldn't know hot to change the configuration.

This means that you can launch a new one (i.e. abort, clone, edit, enqueue) directly from the web UI and in the UI edit the configuration. Unfortunately it does not support changing the configuration "live"

  
  
Posted one year ago

Okay can I somehow query how many manually/scripted created autoscaler I have and how would I delete them again? Is there a way to query the status and potentially some console output of those manually/scripted created autoscaler?

  
  
Posted one year ago

Any recommendation or working combinations of AMI

I would take the deeplearning AMIs from Nvidia AWS , I think they work on both CPU and GPU machines.
In terms of dockers, python dockers for CPU and nvidia runtime for GPU
https://hub.docker.com/layers/library/python/3.11.2-bullseye/images/sha256-6128ea86d[…]d2c01646d599352f6ddd9893420eb815a06c3b90619f8?context=explore

  
  
Posted one year ago

Unfortunately it does not support changing the configuration "live"

That's okay, that's not so important to me. I'm mainly interested to see how many autoscaler I have currently active and which one I have. But in the application tab I only see the ones that I created online:
I don't seem to be able to track the ones that I created with that script. Do I understand something wrong?
image

  
  
Posted one year ago

I tried the following AMI:

ami-0a4f5a73cdd47fd59 and ami-0dacd2425b81201fb

Error: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-0a4f5a73cdd47fd59]' does not exist
 Error: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-0dacd2425b81201fb]' does not exist

image
image

  
  
Posted one year ago

@<1539780258050347008:profile|CheerfulKoala77> make sure the AMI id matches the zone of the EC2 machine

  
  
Posted one year ago

Sure go to the "All Projects" and filter by Task Type, application / service

  
  
Posted one year ago

Yes the one you create manually is not really of the same "type" as the one you create online, this is why you do not see it there 😞

  
  
Posted one year ago

Thanks @<1523701083040387072:profile|UnevenDolphin73> , I realized that with a specified AMI it works a bit better. I tried with this one: ami-0735c191cf914754d ; which seems to be one of the standard AMIs. But also in that case the instance just freezes:

I also tried: ami-0f1a5f5ada0e7da53 which is [ Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type ]

2023-02-23 22:10:16,862 - clearml.Auto-Scaler - INFO - Autoscaler started
2023-02-23 22:10:16,862 - clearml.Auto-Scaler - INFO - state change: State.READY -> State.RUNNING
2023-02-23 22:10:17,611 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws_t3.medium', prefix='aws-cpu-9', queue='cpu-queue'
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - monitor spots started
2023-02-23 22:10:17,655 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws_t3.medium
2023-02-23 22:10:18,006 - clearml.Auto-Scaler - INFO - --- Cloud instances (0) ---
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
2023-02-23 22:10:18,613 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 22:10:19,242 - clearml.Auto-Scaler - INFO - New instance i-0a8239d165fbac686 listening to cpu-queue queue
2023-02-23 14:11:20
2023-02-23 22:11:16,606 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 22:11:19,117 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:11:19,465 - clearml.Auto-Scaler - INFO - --- Cloud instances (1) ---
2023-02-23 22:11:19,466 - clearml.Auto-Scaler - INFO - i-0a8239d165fbac686, regular
2023-02-23 22:11:19,890 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 14:12:21
2023-02-23 22:12:16,632 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 22:12:20,349 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:12:20,682 - clearml.Auto-Scaler - INFO - --- Cloud instances (1) ---
2023-02-23 22:12:20,683 - clearml.Auto-Scaler - INFO - i-0a8239d165fbac686, regular
2023-02-23 22:12:21,017 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 14:13:17
2023-02-23 22:13:16,660 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 14:13:22
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2023-02-23 22:13:21,498 - clearml.Auto-Scaler - INFO - Spinning down stuck worker: 'aws-cpu-9:aws_t3.medium:t3.medium:i-0a8239d165fbac686'
2023-02-23 22:13:21,934 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'aws-cpu-9:aws_t3.medium:t3.medium:i-0a8239d165fbac686'
2023-02-23 22:13:21,968 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:13:21,974 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws_t3.medium', prefix='aws-cpu-9', queue='cpu-queue'
2023-02-23 22:13:21,984 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 22:13:22,001 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws_t3.medium

Also I don't quite know what is a reasonable docker image for running some CPU processing with python - For the ClearML GPU Compute I use nvidia/cuda:11.4.3-runtime-ubuntu20.04 which seems to be working fine.

But I would like to try bigger GPU machines with AWS and also some CPU loads from AWS. Any recommendation or working combinations of AMI and Docker image?
image

  
  
Posted one year ago
1K Views
18 Answers
one year ago
one year ago
Tags