Reputation
Badges 1
28 × Eureka!It’s a clone of a previous one, since I’ve failed -> cloned -> changed params -> failed -> clone -> …
also, when the AZ spec is left empty (it’s optional), machines fail to start withInvalid type for parameter LaunchSpecification.Placement.AvailabilityZone, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
Thank you for the reply, Alon and looking into the issue. I’m not speaking of the app/launcher app itself inheriting IAM permissions by assuming the role – this is absolutely understandable as it’s in your cloud. Contrary, when launching an AWS auto-scaler app, what stops adding ‘IAM Role’ to the list of parameters of the machine group? I can’t wrap my head around this.
…or running the auto-scaler on our own (with own node pools) would solve this issue?
Sorry, maybe I’m not getting the whole picture yet.
We don't, we use the SaaS.
Exactly, that is the call I implemented and wrapped it into some serverless code to export data to CloudWatch.
Nope, this is the ‘autoscaler app’ from the web interface of the SaaS. Nothing self-hosted at the moment.
Hi CostlyOstrich36 Thank you for reaching back and sorry for my embarrassingly long answer here.
We launch the job on a remote executor this way –
From a repository where the code lies we launch a job with clearml-task
in this form: ` clearml-task --name projx-1-debug-$(date +%s) --project kbc --folder . --script projx/kbcrun.py --requirements custom-requirements.txt --docker python:3.9.13 --task-type training --queue aws-cpu-docker --args graph_dir=not/needed config_path=examples/tr...
Why does the autoscaler app then ask for AWS credentials? 🙂
Hey AgitatedDove14 , basically https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-instance-protection.html would allow blocking the machine from being scaled-in when there is a scale-in event in the ASG.
The ASG is responsible for spinning up on demand in the ClearML queue, but spinning down is less trivial – we cannot just spin down is the queue is empty (some machine can still be running something important!)
ok, I misread. The launch code runs in the SaaS, but it uses credentials to launch machines in our cloud. What stops it then from specifying an IAM role existing in our cloud? Isn’t this just an API call?
Hello Jake, sorry for the delay. I found the original auto scaler script and now understand more about how stuff works.
Can you please help understand how Clear assigns jobs to queues – will there be more than 1 job per machine at a single point in time? I do understand that if there are 2 agents running for each GPU, then a vacant agent will probably take a job from the queue. But what’s the general case here?
Thanks so much and sorry if this was already explained – feel free to point me to...
JitteryCoyote63 so you don’t need to use creds anymore?
Hi folks, thanks for replies. We’ll go and upgrade to try.
Thank you for explaining. As we are implementing a custom auto-scaler for the machines, it looks like the queue length - or a rate of messages falling into the queue - are the best indicators for scaling up or down.
It’s like a completion hook when the job terminates (whatever success or failure).
What I’m thinking of: instance scale-in in an ASG doesn’t happen if instance protection is enabled:
Agent fetches job and starts container; Instance protection enabled with API call ran in extra_docker_shell_script
, job launched. Job finishes; Instance protection get disabled in this post-run hook , instance may be terminated.
Sounds okay... Will I need state to calculate the idle time over time, or there's some idle
param in the API answer? Because ideally I'd run this in a stateless lambda.
T hanks. I guess there are too many moving parts in the official implementation that need adaptation, and wrap up – such as the use of credentials instead of IAM, since it's designed to work cross-cloud (or cloud-agnostic), hence for us it's easier to reimpl the wheel. 🙃
Hey AgitatedDove14 , thanks for having this discussion 🙂
We are collecting machine/task data from ClearML API using a Lambda and push it to CloudWatch as 1 or 0 datapoints per-machine, for a machine doing work or not accordingly. Another lambda, run on an ASG termination event, compares the incoming machine list with the list of machines from CW which are not running anything for x minutes and return the intersection. The ASG then terminates only machines doing nothing during the last per...
And maybe adding idle time spent without a job to API is not that a bad idea 😉
Lambda’s are designed to be short-lived, I don’t think it’s a fine idea to run it in a loop TBH.
Tricky question!
I see this asg with a TargetTrackingPolicy for both scale up (if queue size >0) and down, but scale down goes (additionally or only) through a custom policy – check if specific machine can be shutdown. For this we need to make sure there's no job running there. Two ways to do it –
- Instance protection set on/off which is simple.
- Compare machines that the ASG wants to shutdown with machines having
tasks {}
retrieved from the API. If task is running, avoid ...
So this is not for end-user convenience like sending slack messages, but rather system-related hooks useful for auto-scaling, internal API’s and such. If this functionality is not available out of the box, we’d need to resort to looking into scaling-in in a different way. We think of:
Scaling in on very low group average CPU/GPU usage. Non-reliable, because a machine could be running data uploading or else low-load work. Using an https://docs.aws.amazon.com/autoscaling/ec2/userguide/lambda-c...
This a re-implementation I'd say.
Every instance is running an agent in docker mode. One agent = one task for autoscaling purposes.
Hi SuccessfulKoala55 , just wondering – you mentioned the open-source autoscaler code version, but where is it hosted? I’ve only found https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py , but it looks like code to launch a managed auto-scaler instead 🙂
Thank you, Martin. Probably then a simple Lambda that constantly monitors the workers and sets/unsets the protection flag should work. Though I’d avoid writing timestamp to any kind of state. What if I write the last active state in an instance tag? This could be a solution…w = get_clearml_workers() for instance in w: if instance['processing_job'] is True: instance_tag['last_job_seen'] = current_time() else: compare_times_and_allow_shutdown_if_idle() ...
Yes, why not. I think it's also an option.
I'd assume not, trust John here.