Unanswered
Hi All,
I Am Trying To Spin Up Some Aws Autoscaler Instances, But I Seem To Have Some Issues With The Instance Creation:
Thanks @<1523701083040387072:profile|UnevenDolphin73> , I realized that with a specified AMI it works a bit better. I tried with this one: ami-0735c191cf914754d
; which seems to be one of the standard AMIs. But also in that case the instance just freezes:
I also tried: ami-0f1a5f5ada0e7da53
which is [ Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type ]
2023-02-23 22:10:16,862 - clearml.Auto-Scaler - INFO - Autoscaler started
2023-02-23 22:10:16,862 - clearml.Auto-Scaler - INFO - state change: State.READY -> State.RUNNING
2023-02-23 22:10:17,611 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws_t3.medium', prefix='aws-cpu-9', queue='cpu-queue'
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 22:10:17,612 - clearml.Auto-Scaler - INFO - monitor spots started
2023-02-23 22:10:17,655 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws_t3.medium
2023-02-23 22:10:18,006 - clearml.Auto-Scaler - INFO - --- Cloud instances (0) ---
ClearML Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring
2023-02-23 22:10:18,613 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 22:10:19,242 - clearml.Auto-Scaler - INFO - New instance i-0a8239d165fbac686 listening to cpu-queue queue
2023-02-23 14:11:20
2023-02-23 22:11:16,606 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 22:11:19,117 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:11:19,465 - clearml.Auto-Scaler - INFO - --- Cloud instances (1) ---
2023-02-23 22:11:19,466 - clearml.Auto-Scaler - INFO - i-0a8239d165fbac686, regular
2023-02-23 22:11:19,890 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 14:12:21
2023-02-23 22:12:16,632 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 22:12:20,349 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:12:20,682 - clearml.Auto-Scaler - INFO - --- Cloud instances (1) ---
2023-02-23 22:12:20,683 - clearml.Auto-Scaler - INFO - i-0a8239d165fbac686, regular
2023-02-23 22:12:21,017 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2023-02-23 14:13:17
2023-02-23 22:13:16,660 - usage_reporter - INFO - Sending usage report for 60 usage seconds, 1 units
2023-02-23 14:13:22
ClearML Monitor: Could not detect iteration reporting, falling back to iterations as seconds-from-start
2023-02-23 22:13:21,498 - clearml.Auto-Scaler - INFO - Spinning down stuck worker: 'aws-cpu-9:aws_t3.medium:t3.medium:i-0a8239d165fbac686'
2023-02-23 22:13:21,934 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'aws-cpu-9:aws_t3.medium:t3.medium:i-0a8239d165fbac686'
2023-02-23 22:13:21,968 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'cpu-queue'
2023-02-23 22:13:21,974 - clearml.Auto-Scaler - INFO - Spinning new instance resource='aws_t3.medium', prefix='aws-cpu-9', queue='cpu-queue'
2023-02-23 22:13:21,984 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2023-02-23 22:13:22,001 - clearml.Auto-Scaler - INFO - Creating regular instance for resource aws_t3.medium
Also I don't quite know what is a reasonable docker image for running some CPU processing with python - For the ClearML GPU Compute I use nvidia/cuda:11.4.3-runtime-ubuntu20.04
which seems to be working fine.
But I would like to try bigger GPU machines with AWS and also some CPU loads from AWS. Any recommendation or working combinations of AMI and Docker image?
177 Views
0
Answers
one year ago
one year ago