this time it got stuck...
2025-01-22 12:54:32
2025-01-22 10:54:27,220 - clearml.Auto-Scaler - INFO - --- Cloud instances (0):
2025-01-22 10:54:27,697 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:55:32
2025-01-22 10:55:28,230 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:55:28,235 - clearml.Auto-Scaler - INFO - Spinning new instance resource='pe-jobs', prefix='dynamic_aws', queue='default'
2025-01-22 10:55:28,236 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2025-01-22 10:55:28,237 - clearml.Auto-Scaler - INFO - monitor spots started
2025-01-22 10:55:28,248 - clearml.Auto-Scaler - INFO - Creating spot instance for resource pe-jobs
2025-01-22 10:55:28,605 - clearml.Auto-Scaler - INFO - --- Cloud instances (0):
2025-01-22 10:55:28,834 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:55:48
2025-01-22 10:55:46,908 - clearml.Auto-Scaler - INFO - New instance i-00bf957a948e2a52f listening to default queue
2025-01-22 12:56:33
2025-01-22 10:56:29,571 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:56:29,997 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 10:56:30,423 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:57:34
2025-01-22 10:57:31,205 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:57:31,566 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 10:57:32,028 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:58:35
2025-01-22 10:58:32,726 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:58:33,118 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 10:58:33,331 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 12:59:35
2025-01-22 10:59:34,016 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 10:59:34,407 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 10:59:34,789 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:00:36
2025-01-22 11:00:35,800 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:00:36,189 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 13:00:41
2025-01-22 11:00:36,733 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:01:42
2025-01-22 11:01:37,533 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:01:37,918 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-00bf957a948e2a52f (spot)
2025-01-22 11:01:38,141 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:02:32
2025-01-22 11:02:31,172 - clearml.Auto-Scaler - INFO - Worker 'dynamic_aws:pe-jobs:g4dn.xlarge:i-00bf957a948e2a52f' does not have an active task
2025-01-22 11:02:31,172 - clearml.Auto-Scaler - WARNING - The following instances have crashed:
* i-00bf957a948e2a52f
2025-01-22 13:02:42
2025-01-22 11:02:38,653 - clearml.Auto-Scaler - INFO - Spinning down stuck worker dynamic_aws:pe-jobs:g4dn.xlarge:i-00bf957a948e2a52f from stale_spun
2025-01-22 11:02:39,225 - clearml.Auto-Scaler - INFO - Stuck worker spun down: 'dynamic_aws:pe-jobs:g4dn.xlarge:i-00bf957a948e2a52f'
2025-01-22 11:02:39,421 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:02:39,427 - clearml.Auto-Scaler - INFO - Spinning new instance resource='pe-jobs', prefix='dynamic_aws', queue='default'
2025-01-22 11:02:39,427 - clearml.Auto-Scaler - INFO - spinning up worker without specific subnet or availability zone
2025-01-22 11:02:39,537 - clearml.Auto-Scaler - INFO - Creating spot instance for resource pe-jobs
2025-01-22 11:02:39,902 - clearml.Auto-Scaler - INFO - --- Cloud instances (0):
2025-01-22 11:02:40,129 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:03:02
2025-01-22 11:02:58,177 - clearml.Auto-Scaler - INFO - New instance i-0c2be2fda2dd959f9 listening to default queue
2025-01-22 13:03:43
2025-01-22 11:03:40,807 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:03:41,222 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-0c2be2fda2dd959f9 (spot)
2025-01-22 11:03:41,450 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
2025-01-22 13:04:44
2025-01-22 11:04:42,247 - clearml.Auto-Scaler - INFO - Found 1 tasks in queue 'default'
2025-01-22 11:04:42,631 - clearml.Auto-Scaler - INFO - --- Cloud instances (1): i-0c2be2fda2dd959f9 (spot)
2025-01-22 11:04:42,842 - clearml.Auto-Scaler - INFO - Idle for 60.00 seconds
looks like it's using python2...
I manually launched an instance from the same ami, ssh'ed into and run:
$ python --version
Python 3.7.6
$ which python
/home/ubuntu/anaconda3/bin/python
could it be something related to docker image?
I added these commands to init script and found that its true, the environment uses python 2.7, it's a venv that clearml agent created:
[ 276.668028] cloud-init[1895]: -su: /usr/bin/bash: No such file or directory
[ 276.670523] cloud-init[1895]: + python --version
[ 276.671649] cloud-init[1895]: Python 2.7.17
[ 276.672276] cloud-init[1895]: + which python
[ 276.673469] cloud-init[1895]: /clearml_agent_venv/bin/python
I manually launched an instance from the same ami, ssh'ed into and run:
$ python --version
Python 3.7.6
$ which python
/home/ubuntu/anaconda3/bin/python
this instance also have both python 2 and python 3
$ /usr/bin/python3 --version
Python 3.6.9
$ /usr/bin/python --version
Python 2.7.17
Python 2 is no longer supported, I'd suggest finding an AMI that already has python3 built in (Or install it using the init script, not suggested though) and also CUDA enabled to avoid that installation to support cuda images
full log of the ec2 instance like you provided earlier but from an instance after you've added the init script i mentioned to the autoscaler (stop running one and clone it and make the change)
and add the log of the machine please
what do you mean?
CloudyWalrus66 I have a feeling something is wrong with the instance script, can you find and paste here the User Data for this instance? (you'll find it in the AWS Instances management screen when you right-click the instance, under one of the options)
the default use case is when we run our code in this ami in a predefined conda environment
Also can you provide the configuration of the autoscaler? You can export it through the webUI just make sure to scrape off any credentials
Hi CloudyWalrus66 , can you provide the full log of the ec2 instance?
Also try adding the following to the bash init script
python -m pip install -U clearml-agent
and add the log of the machine please
The AMI you used, does it have python preinstalled?