We just redeployed to use the 1.1.4 version as Jake suggested, so the logs are gone ๐
No it does not show up. The instance spins up and then does nothing.
Maybe it's the missing .bashrc
file actually. I'll look into it.
Can you perhaps try an earlier version? Say ClearML SDK 1.1.4?
Yup, latest version of ClearML SDK, and we're deployed on AWS using K8s helm
We do not CostlyFox64 , but this is useful for the future ๐ Thanks!
TimelyPenguin76 I'll have a look, one moment.
UnevenDolphin73
Maybe it's the missingย
.bashrc
ย file actually. I'll look into it.
That's actually something we intend to merge from the PR you've mentioned ๐
Hey UnevenDolphin73 I'm also interested in this feature. Currently trying it out
Just to figure out if that's related to some changes introduced in 1.1.5
Well, the assumption is that your network is configured correctly, is any case the correct way is probably to make sure the autoscaler will tag the instances and your AWS configuration will set the correct rules according to predefined tags
UnevenDolphin73 https://github.com/allegroai/clearml/pull/534 is still under investigation and I do not assume it will reveal a bug - the task_id should be inconsequential as long as the new instance is spinning with a provided queue - do you see the agent on that instance reporting in the "Queues and Workers" screen?
Hi UnevenDolphin73 ,
If the ec2 instance is up and running but no clearml-agent is running, something in the user data script failed.
Can you share the logs from the instance (you can send in DM if you like)?
Hi UnevenDolphin73 ,
try to re run it, a new instance will be created, under this specific instance Actions you have Monitoring and troubleshoot, and you can select Get system logs
I want to verify you scaler doesnt have any failures in this log
The network is configured correctly ๐ But the newly spun up instances need to be set to the same VPC/Subnet somehow
Well, the PR does not solve the issue, but basically tried to make the autoscaler behaves differently (i.e. not pull from queue, but always take a specific task ID)
From our IT dept:
Not really, when you launch the instance, the launch has to already be in the right VPC/Subnet. Configuration tools are irrelevant here.
I'll see if we can do that still (as the queue name suggests, this was a POC, so I'm trying to fix things before they give up ๐ ).
Any other thoughts? The original thread https://clearml.slack.com/archives/CTK20V944/p1641490355015400 suggests this PR solved the issue
Are you using the latest version of ClearML SDK?
Anything specific we should look into TimelyPenguin76 ?
SuccessfulKoala55 TimelyPenguin76
After looking into it, I think it's because our AMI does not have docker, and that the default instance suggested by ClearML auto scaler example is outdated
I'll have some reports tomorrow I hope TimelyPenguin76 SuccessfulKoala55 !
We're still working these quirks out. But one issue after we changed the AMI is that the VPC (SubnetId?) was missing from the instance so it could not reach the ClearML API server.
I think maybe the autoscaler service is missing some additional settings...
If you have GPU autoscaling nodes in your k8s cluster already, you could also give the k8s glue agent a go https://github.com/allegroai/clearml-helm-charts/blob/9c15a8a348898aed5504420778d0e815b41642e5/charts/clearml/values.yaml#L300 ?
With the correct tolerations/nodeselectors you can have k8s take care of the autoscaling for you by just spinning up a new pod
and don't give up ๐ - we'll make it work
... and any way to define the VPC is missing too ๐ค
Ah, probably https://github.com/allegroai/clearml/pull/534
Will try this out.