Seems like you're missing an image definition (AMI or otherwise)
Thanks @<1523701083040387072:profile|UnevenDolphin73> , I realized that with a specified AMI it works a bit better. I tried with this one: ami-0735c191cf914754d
; which seems to be one of the standard AMIs. But also in that case the instance just freezes:
I also tried: ami-0f1a5f5ada0e7da53
which is [ Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type ]
2023-02-23 22:10:16,862 - clearml.Auto-Scaler - INFO - Autoscaler started
2023-02-23 22:10:16,862 - clearml.Auto-Scaler - INFO - state change: State.READY -> State.RUNNING
2023-02-23 22:10:17,611 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws_t3.medium', prefix='aws-cpu-9', queue='cpu-queue'
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - monitor spots started
2023-02-23 22:10:17,655 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws_t3.medium
2023-02-23 22:10:18,006 - clearml.Auto-Scaler - INFO - --- Cloud instances (0) ---
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
2023-02-23 22:10:18,613 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 22:10:19,242 - clearml.Auto-Scaler - INFO - New instance i-0a8239d165fbac686 listening to cpu-queue queue
2023-02-23 14:11:20
2023-02-23 22:11:16,606 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 22:11:19,117 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:11:19,465 - clearml.Auto-Scaler - INFO - --- Cloud instances (1) ---
2023-02-23 22:11:19,466 - clearml.Auto-Scaler - INFO - i-0a8239d165fbac686, regular
2023-02-23 22:11:19,890 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 14:12:21
2023-02-23 22:12:16,632 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 22:12:20,349 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:12:20,682 - clearml.Auto-Scaler - INFO - --- Cloud instances (1) ---
2023-02-23 22:12:20,683 - clearml.Auto-Scaler - INFO - i-0a8239d165fbac686, regular
2023-02-23 22:12:21,017 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 14:13:17
2023-02-23 22:13:16,660 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 14:13:22
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2023-02-23 22:13:21,498 - clearml.Auto-Scaler - INFO - Spinning down stuck worker: 'aws-cpu-9:aws_t3.medium:t3.medium:i-0a8239d165fbac686'
2023-02-23 22:13:21,934 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'aws-cpu-9:aws_t3.medium:t3.medium:i-0a8239d165fbac686'
2023-02-23 22:13:21,968 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:13:21,974 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws_t3.medium', prefix='aws-cpu-9', queue='cpu-queue'
2023-02-23 22:13:21,984 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 22:13:22,001 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws_t3.medium
Also I don't quite know what is a reasonable docker image for running some CPU processing with python - For the ClearML GPU Compute I use nvidia/cuda:11.4.3-runtime-ubuntu20.04
which seems to be working fine.
But I would like to try bigger GPU machines with AWS and also some CPU loads from AWS. Any recommendation or working combinations of AMI and Docker image?
Any recommendation or working combinations of AMI
I would take the deeplearning AMIs from Nvidia AWS , I think they work on both CPU and GPU machines.
In terms of dockers, python dockers for CPU and nvidia runtime for GPU
https://hub.docker.com/layers/library/python/3.11.2-bullseye/images/sha256-6128ea86d[…]d2c01646d599352f6ddd9893420eb815a06c3b90619f8?context=explore
I tried the following AMI:
ami-0a4f5a73cdd47fd59
and ami-0dacd2425b81201fb
Error: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-0a4f5a73cdd47fd59]' does not exist
Error: An error occurred (InvalidAMIID.NotFound) when calling the RunInstances operation: The image id '[ami-0dacd2425b81201fb]' does not exist
Not quite sure how to proceed. Any suggestion how a working combination would look like would be appreciated. Also those NVIDIA AMI seem to be mainly for large instances with GPU not sure if it's possible to run them also on a CPU? @<1523701205467926528:profile|AgitatedDove14>
@<1539780258050347008:profile|CheerfulKoala77> make sure the AMI id matches the zone of the EC2 machine
Okay great i think i got it to work now with this AMI: ami-0c17f9e857dbd4c40
and with the python:3.10.10-bullseye
dockerimage
My next issue is now to create the autoscaler via this script . The script runs through and I see a task which finishes successfull.
But I can't find the autoscaler anywhere on the WebUi.
Also this message suggests that I can change the configuration, but as said I can't find it anywhere and wouldn't know hot to change the configuration.
print("AWS Autoscaler setup wizard\n"
"---------------------------\n"
"Follow the wizard to configure your AWS auto-scaler service.\n"
"Once completed, you will be able to view and change the configuration in the clearml-server web UI.\n"
"It means there is no need to worry about typos or mistakes :)\n")
Also this message suggests that I can change the configuration, but as said I can't find it anywhere and wouldn't know hot to change the configuration.
This means that you can launch a new one (i.e. abort, clone, edit, enqueue) directly from the web UI and in the UI edit the configuration. Unfortunately it does not support changing the configuration "live"
Unfortunately it does not support changing the configuration "live"
That's okay, that's not so important to me. I'm mainly interested to see how many autoscaler I have currently active and which one I have. But in the application tab I only see the ones that I created online:
I don't seem to be able to track the ones that I created with that script. Do I understand something wrong?
Yes the one you create manually is not really of the same "type" as the one you create online, this is why you do not see it there 😞
Okay can I somehow query how many manually/scripted created autoscaler I have and how would I delete them again? Is there a way to query the status and potentially some console output of those manually/scripted created autoscaler?
Sure go to the "All Projects" and filter by Task Type, application / service
Okay, I see the picture below is that what you referring to? That experiment says it's completed, does it mean that the autoscaler is running or not? For me it sounds like the starting of the service is completed but I don't really see if the autoscaler is actually running. Also I don't see any output in the console of the autoscaler.
That experiment says it's completed, does it mean that the autoscaler is running or not?
Not running, it will be "running" if actually being executed
or me it sounds like the starting of the service is completed but I don't really see if the autoscaler is actually running. Also I don't see any output in the console of the autoscaler.
Do notice the autoscaler code itself needs to run somewhere, by default it will be running on your machine, or on a remote agent,
@<1539780258050347008:profile|CheerfulKoala77> you may also need to define subnet or security groups.
Personally I do not see the point in Docker over EC2 instances for CPU instances (virtualization on top of virtualization).
Finally, just to make sure, you only ever need one autoscaler. You can monitor multiple queues with multiple instance types with one autoscaler.