Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I'M Trying To Set Up Aws Autoscaler To Spin Up Ec2 Instances From Predefined Ami So I Was Able To Set Up The Autoscaler, But I Am Experiencing Some Issues With Spinning Up The Ec2 Instance. Seems Like It Keeps Failing (Spinning Up An Instance, The

Hi all,
I'm trying to set up aws autoscaler to spin up ec2 instances from predefined ami
so I was able to set up the autoscaler, but i am experiencing some issues with spinning up the ec2 instance.
seems like it keeps failing (spinning up an instance, then terminating, then spinning up a new one, and so on...)
I looked at the instance log and it appears that for some reason clearml_agent module can't be found

here are some issues I found from a sample log from one of the instances:

...
[  187.966370] cloud-init[1865]: Setting up libkeyutils1:amd64 (1.5.9-9.2ubuntu2.1) ...
[  187.982693] cloud-init[1865]: Setting up aws-neuron-runtime (1.6.24.0) ...
[  188.003002] cloud-init[1865]: No device found - you may need to install aws-neuron-dkms package
         Starting Load Kernel Modules...
[FAILED] Failed to start Load Kernel Modules.
See 'systemctl status systemd-modules-load.service' for details.
...
[  248.294021] cloud-init[1865]: -su: /usr/bin/bash: No such file or directory
[  248.296271] cloud-init[1865]: + python -m clearml_agent --config-file /root/clearml.conf daemon --queue default --docker nvidia/cuda:10.2-cudnn7-runtime-ubuntu18.04
         Stopping User Manager for UID 1000...
[  OK  ] Stopped User Manager for UID 1000.
[  OK  ] Removed slice User Slice of ubuntu.
[  248.310852] cloud-init[1865]: /clearml_agent_venv/bin/python: No module named clearml_agent
[  248.312187] cloud-init[1865]: + '[' 1 -ne 0 ']'
[  248.312869] cloud-init[1865]: + echo 'Shutting down clearml-agent. clearml-agent return code 0'
[  248.313011] cloud-init[1865]: Shutting down clearml-agent. clearml-agent return code 0
[  248.313300] cloud-init[1865]: + shutdown
[  248.318470] cloud-init[1865]: Shutdown scheduled for Wed 2025-01-22 08:40:18 UTC, use 'shutdown -c' to cancel.

does anyone have an idea how should i proceed from here?

  
  
Posted 2 months ago
Votes Newest

Answers 14


Sure

  
  
Posted 2 months ago

this time it got stuck...

2025-01-22 12:54:32
2025-01-22 10:54:27,220 - clearml.Auto-Scaler - INFO - --- Cloud instances (0): 
2025-01-22 10:54:27,697 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:55:32
2025-01-22 10:55:28,230 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:55:28,235 - clearml.Auto-Scaler - INFO - Spinning new instance resource='pe-jobs', prefix='dynamic_aws', queue='default'
2025-01-22 10:55:28,236 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2025-01-22 10:55:28,237 - clearml.Auto-Scaler - INFO - monitor spots started
2025-01-22 10:55:28,248 - clearml.Auto-Scaler - INFO - Creating spot instance for resource pe-jobs
2025-01-22 10:55:28,605 - clearml.Auto-Scaler - INFO - --- Cloud instances (0): 
2025-01-22 10:55:28,834 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:55:48
2025-01-22 10:55:46,908 - clearml.Auto-Scaler - INFO - New instance i-00bf957a948e2a52f listening to default queue
2025-01-22 12:56:33
2025-01-22 10:56:29,571 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:56:29,997 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 10:56:30,423 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:57:34
2025-01-22 10:57:31,205 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:57:31,566 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 10:57:32,028 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:58:35
2025-01-22 10:58:32,726 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:58:33,118 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 10:58:33,331 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:59:35
2025-01-22 10:59:34,016 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:59:34,407 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 10:59:34,789 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:00:36
2025-01-22 11:00:35,800 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:00:36,189 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 13:00:41
2025-01-22 11:00:36,733 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:01:42
2025-01-22 11:01:37,533 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:01:37,918 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 11:01:38,141 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:02:32
2025-01-22 11:02:31,172 - clearml.Auto-Scaler - INFO - Worker 'dynamic_aws:pe-jobs:g4dn.xlarge:i-00bf957a948e2a52f' does not have an active task
2025-01-22 11:02:31,172 - clearml.Auto-Scaler - WARNING - The following instances have crashed:
* i-00bf957a948e2a52f
2025-01-22 13:02:42
2025-01-22 11:02:38,653 - clearml.Auto-Scaler - INFO - Spinning down stuck worker dynamic_aws:pe-jobs:g4dn.xlarge:i-00bf957a948e2a52f from stale_spun
2025-01-22 11:02:39,225 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'dynamic_aws:pe-jobs:g4dn.xlarge:i-00bf957a948e2a52f'
2025-01-22 11:02:39,421 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:02:39,427 - clearml.Auto-Scaler - INFO - Spinning new instance resource='pe-jobs', prefix='dynamic_aws', queue='default'
2025-01-22 11:02:39,427 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2025-01-22 11:02:39,537 - clearml.Auto-Scaler - INFO - Creating spot instance for resource pe-jobs
2025-01-22 11:02:39,902 - clearml.Auto-Scaler - INFO - --- Cloud instances (0): 
2025-01-22 11:02:40,129 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:03:02
2025-01-22 11:02:58,177 - clearml.Auto-Scaler - INFO - New instance i-0c2be2fda2dd959f9 listening to default queue
2025-01-22 13:03:43
2025-01-22 11:03:40,807 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:03:41,222 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-0c2be2fda2dd959f9 (spot)
2025-01-22 11:03:41,450 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:04:44
2025-01-22 11:04:42,247 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:04:42,631 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-0c2be2fda2dd959f9 (spot)
2025-01-22 11:04:42,842 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds

looks like it's using python2...
I manually launched an instance from the same ami, ssh'ed into and run:
$ python --version
Python 3.7.6
$ which python
/home/ubuntu/anaconda3/bin/python

could it be something related to docker image?

  
  
Posted 2 months ago

I added these commands to init script and found that its true, the environment uses python 2.7, it's a venv that clearml agent created:

[  276.668028] cloud-init[1895]: -su: /usr/bin/bash: No such file or directory
[  276.670523] cloud-init[1895]: + python --version
[  276.671649] cloud-init[1895]: Python 2.7.17
[  276.672276] cloud-init[1895]: + which python
[  276.673469] cloud-init[1895]: /clearml_agent_venv/bin/python
  
  
Posted 2 months ago

I manually launched an instance from the same ami, ssh'ed into and run:
$ python --version
Python 3.7.6
$ which python
/home/ubuntu/anaconda3/bin/python

this instance also have both python 2 and python 3
$ /usr/bin/python3 --version
Python 3.6.9

$ /usr/bin/python --version
Python 2.7.17

  
  
Posted 2 months ago

Python 2 is no longer supported, I'd suggest finding an AMI that already has python3 built in (Or install it using the init script, not suggested though) and also CUDA enabled to avoid that installation to support cuda images

  
  
Posted 2 months ago

full log of the ec2 instance like you provided earlier but from an instance after you've added the init script i mentioned to the autoscaler (stop running one and clone it and make the change)

  
  
Posted 2 months ago

and add the log of the machine please

what do you mean?

  
  
Posted 2 months ago

CloudyWalrus66 I have a feeling something is wrong with the instance script, can you find and paste here the User Data for this instance? (you'll find it in the AWS Instances management screen when you right-click the instance, under one of the options)

  
  
Posted 2 months ago

the default use case is when we run our code in this ami in a predefined conda environment

  
  
Posted 2 months ago

Also can you provide the configuration of the autoscaler? You can export it through the webUI just make sure to scrape off any credentials

  
  
Posted 2 months ago

Hi CloudyWalrus66 , can you provide the full log of the ec2 instance?

  
  
Posted 2 months ago

Sure, here it is:

  
  
Posted 2 months ago

Also try adding the following to the bash init script

python -m pip install -U clearml-agent

and add the log of the machine please

  
  
Posted 2 months ago

The AMI you used, does it have python preinstalled?

  
  
Posted 2 months ago
229 Views
14 Answers
2 months ago
2 months ago
Tags