Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
We'Re Trying To Use The Aws Autoscaler And Have Managed To Get It Up And Running With Spinning Up Instances. However, It Does Not Seem To Pull Any Of The Tasks For The Remote Instances. We See It Gets

We're trying to use the AWS autoscaler and have managed to get it up and running with spinning up instances. However, it does not seem to pull any of the tasks for the remote instances. We see it gets task_id=None , for example:

2022-01-25 10:00:28,362 - clearml.auto_scaler - INFO - Spinning new instance resource='aws4cpu', prefix='dynamic_aws', queue='aws_poc_1', task_id=None

  
  
Posted 2 years ago
Votes Newest

Answers 29


Any thoughts SuccessfulKoala55 ?

  
  
Posted 2 years ago

Yup, latest version of ClearML SDK, and we're deployed on AWS using K8s helm

  
  
Posted 2 years ago

UnevenDolphin73

Maybe it's the missingย 

.bashrc

ย file actually. I'll look into it.

That's actually something we intend to merge from the PR you've mentioned ๐Ÿ™‚

  
  
Posted 2 years ago

and don't give up ๐Ÿ™‚ - we'll make it work

  
  
Posted 2 years ago

Anything specific we should look into TimelyPenguin76 ?

  
  
Posted 2 years ago

... and any way to define the VPC is missing too ๐Ÿค”

  
  
Posted 2 years ago

UnevenDolphin73 https://github.com/allegroai/clearml/pull/534 is still under investigation and I do not assume it will reveal a bug - the task_id should be inconsequential as long as the new instance is spinning with a provided queue - do you see the agent on that instance reporting in the "Queues and Workers" screen?

  
  
Posted 2 years ago

Hi UnevenDolphin73 ,

If the ec2 instance is up and running but no clearml-agent is running, something in the user data script failed.

Can you share the logs from the instance (you can send in DM if you like)?

  
  
Posted 2 years ago

Hi UnevenDolphin73 ,

try to re run it, a new instance will be created, under this specific instance Actions you have Monitoring and troubleshoot, and you can select Get system logs
I want to verify you scaler doesnt have any failures in this log

  
  
Posted 2 years ago

Ah, probably https://github.com/allegroai/clearml/pull/534

Will try this out.

  
  
Posted 2 years ago

Just to figure out if that's related to some changes introduced in 1.1.5

  
  
Posted 2 years ago

Can you perhaps try an earlier version? Say ClearML SDK 1.1.4?

  
  
Posted 2 years ago

Well, the PR does not solve the issue, but basically tried to make the autoscaler behaves differently (i.e. not pull from queue, but always take a specific task ID)

  
  
Posted 2 years ago

Maybe it's the missing .bashrc file actually. I'll look into it.

  
  
Posted 2 years ago

Well, the assumption is that your network is configured correctly, is any case the correct way is probably to make sure the autoscaler will tag the instances and your AWS configuration will set the correct rules according to predefined tags

  
  
Posted 2 years ago

SuccessfulKoala55 TimelyPenguin76
After looking into it, I think it's because our AMI does not have docker, and that the default instance suggested by ClearML auto scaler example is outdated

  
  
Posted 2 years ago

From our IT dept:

Not really, when you launch the instance, the launch has to already be in the right VPC/Subnet. Configuration tools are irrelevant here.

  
  
Posted 2 years ago

Are you using the latest version of ClearML SDK?

  
  
Posted 2 years ago

No it does not show up. The instance spins up and then does nothing.

  
  
Posted 2 years ago

If you have GPU autoscaling nodes in your k8s cluster already, you could also give the k8s glue agent a go https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L300 ?

With the correct tolerations/nodeselectors you can have k8s take care of the autoscaling for you by just spinning up a new pod

  
  
Posted 2 years ago

I'll have some reports tomorrow I hope TimelyPenguin76 SuccessfulKoala55 !

  
  
Posted 2 years ago

We do not CostlyFox64 , but this is useful for the future ๐Ÿ™‚ Thanks!
TimelyPenguin76 I'll have a look, one moment.

  
  
Posted 2 years ago

The network is configured correctly ๐Ÿ™‚ But the newly spun up instances need to be set to the same VPC/Subnet somehow

  
  
Posted 2 years ago

Define how?

  
  
Posted 2 years ago

Hey UnevenDolphin73 I'm also interested in this feature. Currently trying it out

  
  
Posted 2 years ago

In any case, we'll try and reproduce

  
  
Posted 2 years ago

I'll see if we can do that still (as the queue name suggests, this was a POC, so I'm trying to fix things before they give up ๐Ÿ˜› ).
Any other thoughts? The original thread https://clearml.slack.com/archives/CTK20V944/p1641490355015400 suggests this PR solved the issue

  
  
Posted 2 years ago

We just redeployed to use the 1.1.4 version as Jake suggested, so the logs are gone ๐Ÿ˜ž

  
  
Posted 2 years ago

We're still working these quirks out. But one issue after we changed the AMI is that the VPC (SubnetId?) was missing from the instance so it could not reach the ClearML API server.

I think maybe the autoscaler service is missing some additional settings...

  
  
Posted 2 years ago
1K Views
29 Answers
2 years ago
one year ago
Tags
Similar posts