SO I updated the config with:resource_configurations { A100 { instance_type = "p3.2xlarge" is_spot = false availability_zone = "us-east-1b" ami_id = "ami-04c0416d6bd8e4b1f" ebs_device_name = "/dev/xvda" ebs_volume_size = 100 ebs_volume_type = "gp3" key_name = "<my-key-name>" security_group_ids = ["<my-sg-id>"] subnet_id = "<my-subnet-id>" } }
but I get in the logs of the autoscaler:Warning! exception occurred: An error occurred (InvalidParameter) when calling the RunInstances operation: Security group <my-sg-id> and subnet <default-subnet-id> belong to different networks. Retry in 15 seconds
So it doesn't take into account the subnet-id
, put the instances in the default subnet and fails since the sg is not available for the default sg. Is it a big?
ok, so first, since you have many installations in your bash script, it does make sense that installation would take a long time (note that the agent will only start running after all installations are done)
So for the sake of debugging I'd suggest to remove all the packages (other than the specific trains-agent that you're using) and try again, add those packages to the task you are trying to run and you should see the instance much faster.
Probably something's wrong with the instance, which AMI you used? the default one?
extra_configurations = {'SubnetId': "<subnet-id>"}
with brackets right?
(Btw the instance listed in the console has no name, it it normal?)
If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on
Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed 😄
If I remove security_group_ids
and just let subnet_id
in the configuration, it is not taken into account (the instances are created in a the default subnet)
As an example you can ssh to it and try running trains-agent manually to see if it's installed and if it fails for some reason.
BTW, is there any specific reason for not upgrading to clearml?
I just didn't have time so far 🙂
I waited 20 mins, refreshing the logs ever 2 mins.
Sounds like more than enough
Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125
For some reason the configuration object gets updated at runtime toresource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""
Can you send me your configurations? I want to make sure there's nothing we're missing there.
(without the actual keys and secrets of course)
ok that's odd.
Anyway try settingextra_configurations = {"SubnetId": "<subnet-id>"}
instead of:extra_configurations = {'SubnetId': "<subnet-id>"}
I'll try with that; https://github.com/allegroai/clearml/compare/master...H4dr1en:add-aws-params
Can you check which trains version appears under the package requirements for the autoscaler?
BTW, is there any specific reason for not upgrading to clearml? 🙂
Still getting the same error, it is not taken into account 🤔
` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}
queues {
aws_a100 = [["A100", 15]]
}
extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""
extra_vm_bash_script = """
sudo apt-get install -y libsm6 libxext6 libxrender-dev
python3 -m pip install pip==20.2.3
python3 -m pip install urllib3>=1.25.4
python3 -m pip install opencv-python>=4.1.1.1
python3 -m pip install PyYAML==5.2
python3 -m pip install scipy>=1.2.1
python3 -m pip install albumentations==0.4.3
python3 -m pip install numpy
python3 -m pip install pillow
python3 -m pip install imageio
python3 -m pip install tqdm
python3 -m pip install pandas
python3 -m pip install click
python3 -m pip install tensorboard
python3 -m pip install pyjwt==1.7.1
python3 -m pip install git+
python3 -m pip install git+
Workaround to cache git credentials so that the agent can clone private dependencies as well
mkdir -p ~/.git/credential
chmod 0700 ~/.git/credential
git config --global credential.helper 'cache --socket ~/.git/credential/socket'
sudo git config --system credential.helper 'cache --socket ~/.git/credential/socket'
export TRAINS_LOG_ENVIRONMENT=""
export TRAINS_AGENT_GIT_USER="XYZ"
export TRAINS_AGENT_GIT_PASS="XYZ"
export CUSTOM_VAR="CUSTOM_VAL"
""" `
security_group_ids = ["<sec_group_id>"]
(note that I had a typo it's the id not the name, don't want to misguide you!)
extra_configurations = {"SubnetId": "<subnet-id>"}
That fixed it 😄
Probably something's wrong with the instance, which AMI you used? the default one?
The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f
subnet isn't supported as is in autoscaler, but you can add it using extra_configurations
the following way:extra_configurations = {'SubnetId': <subnet-id>}