Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, Is It Possible To Pass Environment Variables To Agents Created By The Aws Autoscaler Service?

Hi, is it possible to pass environment variables to agents created by the AWS AutoScaler service?

  
  
Posted 3 years ago
Votes Newest

Answers 30


SO I updated the config with:
resource_configurations { A100 { instance_type = "p3.2xlarge" is_spot = false availability_zone = "us-east-1b" ami_id = "ami-04c0416d6bd8e4b1f" ebs_device_name = "/dev/xvda" ebs_volume_size = 100 ebs_volume_type = "gp3" key_name = "<my-key-name>" security_group_ids = ["<my-sg-id>"] subnet_id = "<my-subnet-id>" } }
but I get in the logs of the autoscaler:
Warning! exception occurred: An error occurred (InvalidParameter) when calling the RunInstances operation: Security group <my-sg-id> and subnet <default-subnet-id> belong to different networks. Retry in 15 secondsSo it doesn't take into account the subnet-id , put the instances in the default subnet and fails since the sg is not available for the default sg. Is it a big?

  
  
Posted 3 years ago

ok, so first, since you have many installations in your bash script, it does make sense that installation would take a long time (note that the agent will only start running after all installations are done)
So for the sake of debugging I'd suggest to remove all the packages (other than the specific trains-agent that you're using) and try again, add those packages to the task you are trying to run and you should see the instance much faster.

  
  
Posted 3 years ago

Probably something's wrong with the instance, which AMI you used? the default one?

  
  
Posted 3 years ago

extra_configurations = {'SubnetId': "<subnet-id>"}with brackets right?

  
  
Posted 3 years ago

(Btw the instance listed in the console has no name, it it normal?)

  
  
Posted 3 years ago

But we can easily extend, right?

  
  
Posted 3 years ago

I get the following error:

  
  
Posted 3 years ago

yes

  
  
Posted 3 years ago

If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on

  
  
Posted 3 years ago

Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed 😄

  
  
Posted 3 years ago

If I remove security_group_ids and just let subnet_id in the configuration, it is not taken into account (the instances are created in a the default subnet)

  
  
Posted 3 years ago

As an example you can ssh to it and try running trains-agent manually to see if it's installed and if it fails for some reason.

  
  
Posted 3 years ago

BTW, is there any specific reason for not upgrading to clearml?

I just didn't have time so far 🙂

  
  
Posted 3 years ago

I waited 20 mins, refreshing the logs ever 2 mins.

Sounds like more than enough

  
  
Posted 3 years ago

Also, can you send the entire log?

  
  
Posted 3 years ago

Great!

  
  
Posted 3 years ago

For some reason the configuration object gets updated at runtime to
resource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""

  
  
Posted 3 years ago

Can you send me your configurations? I want to make sure there's nothing we're missing there.
(without the actual keys and secrets of course)

  
  
Posted 3 years ago

ok that's odd.
Anyway try setting
extra_configurations = {"SubnetId": "<subnet-id>"}instead of:
extra_configurations = {'SubnetId': "<subnet-id>"}

  
  
Posted 3 years ago

Can you check which trains version appears under the package requirements for the autoscaler?

  
  
Posted 3 years ago

BTW, is there any specific reason for not upgrading to clearml? 🙂

  
  
Posted 3 years ago

Still getting the same error, it is not taken into account 🤔

  
  
Posted 3 years ago

` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}

queues {
aws_a100 = [["A100", 15]]
}

extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""

extra_vm_bash_script = """

sudo apt-get install -y libsm6 libxext6 libxrender-dev

python3 -m pip install pip==20.2.3
python3 -m pip install urllib3>=1.25.4
python3 -m pip install opencv-python>=4.1.1.1
python3 -m pip install PyYAML==5.2
python3 -m pip install scipy>=1.2.1
python3 -m pip install albumentations==0.4.3
python3 -m pip install numpy
python3 -m pip install pillow
python3 -m pip install imageio
python3 -m pip install tqdm
python3 -m pip install pandas
python3 -m pip install click
python3 -m pip install tensorboard
python3 -m pip install pyjwt==1.7.1
python3 -m pip install git+
python3 -m pip install git+

Workaround to cache git credentials so that the agent can clone private dependencies as well

mkdir -p ~/.git/credential
chmod 0700 ~/.git/credential
git config --global credential.helper 'cache --socket ~/.git/credential/socket'
sudo git config --system credential.helper 'cache --socket ~/.git/credential/socket'

export TRAINS_LOG_ENVIRONMENT=""
export TRAINS_AGENT_GIT_USER="XYZ"
export TRAINS_AGENT_GIT_PASS="XYZ"
export CUSTOM_VAR="CUSTOM_VAL"
""" `

  
  
Posted 3 years ago

security_group_ids = ["<sec_group_id>"] (note that I had a typo it's the id not the name, don't want to misguide you!)

  
  
Posted 3 years ago

extra_configurations = {"SubnetId": "<subnet-id>"}

That fixed it 😄

  
  
Posted 3 years ago

trains==0.16.4

  
  
Posted 3 years ago

Probably something's wrong with the instance, which AMI you used? the default one?

The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f

  
  
Posted 3 years ago

subnet isn't supported as is in autoscaler, but you can add it using extra_configurations the following way:
extra_configurations = {'SubnetId': <subnet-id>}

  
  
Posted 3 years ago