Answered

Hi, Is It Possible To Pass Environment Variables To Agents Created By The Aws Autoscaler Service?

Hi, is it possible to pass environment variables to agents created by the AWS AutoScaler service?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Votes Newest

Answers 30

Probably something's wrong with the instance, which AMI you used? the default one?

The default one is not existing/accessible anymore, I replaced it with the one that was shown in the NVIDIA Deep Learning AMI markplace page https://aws.amazon.com/marketplace/pp/B076K31M1S?qid=1610377938050&sr=0-1&ref_=srh_res_product_title that is: ami-04c0416d6bd8e4b1f

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

SO I updated the config with:
resource_configurations { A100 { instance_type = "p3.2xlarge" is_spot = false availability_zone = "us-east-1b" ami_id = "ami-04c0416d6bd8e4b1f" ebs_device_name = "/dev/xvda" ebs_volume_size = 100 ebs_volume_type = "gp3" key_name = "<my-key-name>" security_group_ids = ["<my-sg-id>"] subnet_id = "<my-subnet-id>" } }
but I get in the logs of the autoscaler:
Warning! exception occurred: An error occurred (InvalidParameter) when calling the RunInstances operation: Security group <my-sg-id> and subnet <default-subnet-id> belong to different networks. Retry in 15 secondsSo it doesn't take into account the subnet-id , put the instances in the default subnet and fails since the sg is not available for the default sg. Is it a big?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I waited 20 mins, refreshing the logs ever 2 mins.

Sounds like more than enough

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

subnet isn't supported as is in autoscaler, but you can add it using extra_configurations the following way:
extra_configurations = {'SubnetId': <subnet-id>}

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Still getting the same error, it is not taken into account 🤔

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

extra_configurations = {'SubnetId': "<subnet-id>"}with brackets right?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Also, can you send the entire log?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

security_group_ids = ["<sec_group_id>"] (note that I had a typo it's the id not the name, don't want to misguide you!)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

extra_configurations = {"SubnetId": "<subnet-id>"}

That fixed it 😄

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

As an example you can ssh to it and try running trains-agent manually to see if it's installed and if it fails for some reason.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

trains==0.16.4

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

BTW, is there any specific reason for not upgrading to clearml?

I just didn't have time so far 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

BTW, is there any specific reason for not upgrading to clearml? 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

Great!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

ok that's odd.
Anyway try setting
extra_configurations = {"SubnetId": "<subnet-id>"}instead of:
extra_configurations = {'SubnetId': "<subnet-id>"}

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

If I remove security_group_ids and just let subnet_id in the configuration, it is not taken into account (the instances are created in a the default subnet)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Probably something's wrong with the instance, which AMI you used? the default one?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

If it does appear in the UI faster, than it's only a matter of waiting. it you still don't see the instance, I'd suggest you to ssh to the instance and investigate a bit what's going on

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

(Btw the instance listed in the console has no name, it it normal?)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Thanks for your answer! I am in the process of adding subnet_id/security_groups_id/key_name to the config to be able to ssh in the machine, will keep you informed 😄

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Can you send me your configurations? I want to make sure there's nothing we're missing there.
(without the actual keys and secrets of course)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

yes

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

ok, so first, since you have many installations in your bash script, it does make sense that installation would take a long time (note that the agent will only start running after all installations are done)
So for the sake of debugging I'd suggest to remove all the packages (other than the specific trains-agent that you're using) and try again, add those packages to the task you are trying to run and you should see the instance much faster.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

I get the following error:

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

But we can easily extend, right?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

` resource_configurations {
A100 {
instance_type = "p3.2xlarge"
is_spot = false
availability_zone = "us-east-1b"
ami_id = "ami-04c0416d6bd8e4b1f"
ebs_device_name = "/dev/xvda"
ebs_volume_size = 100
ebs_volume_type = "gp3"
}
}

queues {
aws_a100 = [["A100", 15]]
}

extra_trains_conf = """
agent.package_manager.system_site_packages = true
agent.package_manager.pip_version = "==20.2.3"
"""

extra_vm_bash_script = """

sudo apt-get install -y libsm6 libxext6 libxrender-dev

python3 -m pip install pip==20.2.3
python3 -m pip install urllib3>=1.25.4
python3 -m pip install opencv-python>=4.1.1.1
python3 -m pip install PyYAML==5.2
python3 -m pip install scipy>=1.2.1
python3 -m pip install albumentations==0.4.3
python3 -m pip install numpy
python3 -m pip install pillow
python3 -m pip install imageio
python3 -m pip install tqdm
python3 -m pip install pandas
python3 -m pip install click
python3 -m pip install tensorboard
python3 -m pip install pyjwt==1.7.1
python3 -m pip install git+
python3 -m pip install git+

Workaround to cache git credentials so that the agent can clone private dependencies as well

mkdir -p ~/.git/credential
chmod 0700 ~/.git/credential
git config --global credential.helper 'cache --socket ~/.git/credential/socket'
sudo git config --system credential.helper 'cache --socket ~/.git/credential/socket'

export TRAINS_LOG_ENVIRONMENT=""
export TRAINS_AGENT_GIT_USER="XYZ"
export TRAINS_AGENT_GIT_PASS="XYZ"
export CUSTOM_VAR="CUSTOM_VAL"
""" `

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

I'll try with that; https://github.com/allegroai/clearml/compare/master...H4dr1en:add-aws-params

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

For some reason the configuration object gets updated at runtime to
resource_configurations = null queues = null extra_trains_conf = "" extra_vm_bash_script = ""

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					JitteryCoyote63
				
					0
					 × 1

Can you check which trains version appears under the package requirements for the autoscaler?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					FriendlySquid61
				
					0

Write your answer

2K Views

30 Answers

4 years ago

2 years ago