Yup, latest version of ClearML SDK, and we're deployed on AWS using K8s helm
UnevenDolphin73
Maybe it's the missingย
.bashrc
ย file actually. I'll look into it.
That's actually something we intend to merge from the PR you've mentioned ๐
and don't give up ๐ - we'll make it work
Anything specific we should look into TimelyPenguin76 ?
... and any way to define the VPC is missing too ๐ค
UnevenDolphin73 https://github.com/allegroai/clearml/pull/534 is still under investigation and I do not assume it will reveal a bug - the task_id should be inconsequential as long as the new instance is spinning with a provided queue - do you see the agent on that instance reporting in the "Queues and Workers" screen?
Hi UnevenDolphin73 ,
If the ec2 instance is up and running but no clearml-agent is running, something in the user data script failed.
Can you share the logs from the instance (you can send in DM if you like)?
Hi UnevenDolphin73 ,
try to re run it, a new instance will be created, under this specific instance Actions you have Monitoring and troubleshoot, and you can select Get system logs
I want to verify you scaler doesnt have any failures in this log
Ah, probably https://github.com/allegroai/clearml/pull/534
Will try this out.
Just to figure out if that's related to some changes introduced in 1.1.5
Can you perhaps try an earlier version? Say ClearML SDK 1.1.4?
Well, the PR does not solve the issue, but basically tried to make the autoscaler behaves differently (i.e. not pull from queue, but always take a specific task ID)
Maybe it's the missing .bashrc
file actually. I'll look into it.
Well, the assumption is that your network is configured correctly, is any case the correct way is probably to make sure the autoscaler will tag the instances and your AWS configuration will set the correct rules according to predefined tags
SuccessfulKoala55 TimelyPenguin76
After looking into it, I think it's because our AMI does not have docker, and that the default instance suggested by ClearML auto scaler example is outdated
From our IT dept:
Not really, when you launch the instance, the launch has to already be in the right VPC/Subnet. Configuration tools are irrelevant here.
Are you using the latest version of ClearML SDK?
No it does not show up. The instance spins up and then does nothing.
If you have GPU autoscaling nodes in your k8s cluster already, you could also give the k8s glue agent a go https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L300 ?
With the correct tolerations/nodeselectors you can have k8s take care of the autoscaling for you by just spinning up a new pod
I'll have some reports tomorrow I hope TimelyPenguin76 SuccessfulKoala55 !
We do not CostlyFox64 , but this is useful for the future ๐ Thanks!
TimelyPenguin76 I'll have a look, one moment.
The network is configured correctly ๐ But the newly spun up instances need to be set to the same VPC/Subnet somehow
Hey UnevenDolphin73 I'm also interested in this feature. Currently trying it out
I'll see if we can do that still (as the queue name suggests, this was a POC, so I'm trying to fix things before they give up ๐ ).
Any other thoughts? The original thread https://clearml.slack.com/archives/CTK20V944/p1641490355015400 suggests this PR solved the issue
We just redeployed to use the 1.1.4 version as Jake suggested, so the logs are gone ๐
We're still working these quirks out. But one issue after we changed the AMI is that the VPC (SubnetId?) was missing from the instance so it could not reach the ClearML API server.
I think maybe the autoscaler service is missing some additional settings...