We'Re Trying To Use The Aws Autoscaler And Have Managed To Get It Up And Running With Spinning Up Instances. However, It Does Not Seem To Pull Any Of The Tasks For The Remote Instances. We See It Gets

Answered

We're trying to use the AWS autoscaler and have managed to get it up and running with spinning up instances. However, it does not seem to pull any of the tasks for the remote instances. We see it gets task_id=None , for example:

2022-01-25 10:00:28,362 - clearml.auto_scaler - INFO - Spinning new instance resource='aws4cpu', prefix='dynamic_aws', queue='aws_poc_1', task_id=None

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Votes Newest

Answers 29

We're still working these quirks out. But one issue after we changed the AMI is that the VPC (SubnetId?) was missing from the instance so it could not reach the ClearML API server.

I think maybe the autoscaler service is missing some additional settings...

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

We just redeployed to use the 1.1.4 version as Jake suggested, so the logs are gone 😞

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Maybe it's the missing .bashrc file actually. I'll look into it.

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

SuccessfulKoala55 TimelyPenguin76
After looking into it, I think it's because our AMI does not have docker, and that the default instance suggested by ClearML auto scaler example is outdated

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Hi UnevenDolphin73 ,

try to re run it, a new instance will be created, under this specific instance Actions you have Monitoring and troubleshoot, and you can select Get system logs
I want to verify you scaler doesnt have any failures in this log

  				
Posted 
	3 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

Just to figure out if that's related to some changes introduced in 1.1.5

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Ah, probably https://github.com/allegroai/clearml/pull/534

Will try this out.

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Can you perhaps try an earlier version? Say ClearML SDK 1.1.4?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Well, the assumption is that your network is configured correctly, is any case the correct way is probably to make sure the autoscaler will tag the instances and your AWS configuration will set the correct rules according to predefined tags

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

From our IT dept:

Not really, when you launch the instance, the launch has to already be in the right VPC/Subnet. Configuration tools are irrelevant here.

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Define how?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

UnevenDolphin73

Maybe it's the missing

.bashrc

file actually. I'll look into it.

That's actually something we intend to merge from the PR you've mentioned 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Any thoughts SuccessfulKoala55 ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

UnevenDolphin73 https://github.com/allegroai/clearml/pull/534 is still under investigation and I do not assume it will reveal a bug - the task_id should be inconsequential as long as the new instance is spinning with a provided queue - do you see the agent on that instance reporting in the "Queues and Workers" screen?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Well, the PR does not solve the issue, but basically tried to make the autoscaler behaves differently (i.e. not pull from queue, but always take a specific task ID)

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

We do not CostlyFox64 , but this is useful for the future 🙂 Thanks!
TimelyPenguin76 I'll have a look, one moment.

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Are you using the latest version of ClearML SDK?

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

If you have GPU autoscaling nodes in your k8s cluster already, you could also give the k8s glue agent a go https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L300 ?

With the correct tolerations/nodeselectors you can have k8s take care of the autoscaling for you by just spinning up a new pod

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyFox64
				
					0
					 × 1

The network is configured correctly 🙂 But the newly spun up instances need to be set to the same VPC/Subnet somehow

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Yup, latest version of ClearML SDK, and we're deployed on AWS using K8s helm

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Anything specific we should look into TimelyPenguin76 ?

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Hi UnevenDolphin73 ,

If the ec2 instance is up and running but no clearml-agent is running, something in the user data script failed.

Can you share the logs from the instance (you can send in DM if you like)?

  				
Posted 
	3 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

In any case, we'll try and reproduce

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

... and any way to define the VPC is missing too 🤔

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

and don't give up 🙂 - we'll make it work

  				
Posted 
	3 years ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

No it does not show up. The instance spins up and then does nothing.

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Hey UnevenDolphin73 I'm also interested in this feature. Currently trying it out

  				
Posted 
	3 years ago

					More  		
  Report
		
					RattyPanda61
				
					0
					 × 1

I'll see if we can do that still (as the queue name suggests, this was a POC, so I'm trying to fix things before they give up 😛 ).
Any other thoughts? The original thread https://clearml.slack.com/archives/CTK20V944/p1641490355015400 suggests this PR solved the issue

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

I'll have some reports tomorrow I hope TimelyPenguin76 SuccessfulKoala55 !

  				
Posted 
	3 years ago

					More  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Write your answer

1K Views

29 Answers

3 years ago

2 years ago