We don't, we use the SaaS.
Exactly, that is the call I implemented and wrapped it into some serverless code to export data to CloudWatch.
…or running the auto-scaler on our own (with own node pools) would solve this issue?
Sorry, maybe I’m not getting the whole picture yet.
ok, I misread. The launch code runs in the SaaS, but it uses credentials to launch machines in our cloud. What stops it then from specifying an IAM role existing in our cloud? Isn’t this just an API call?
Oh, well, the SaaS supports that as well 🙂
Thank you for explaining. As we are implementing a custom auto-scaler for the machines, it looks like the queue length - or a rate of messages falling into the queue - are the best indicators for scaling up or down.
Hi SparklingHedgehong28 , ClearML enqueues jobs/tasks to queues simply according to your request - when you enqueue using the UI, or from code. Any agent monitoring a queue (or more than one queue) can pull work from that queue (agents monitoring more than one queue use a round-robin scheme). In general, an agent will pull and run one job at a time. Multiple agents running on a single machine (each with a different GPU assignment) will pull and run one job at a time each.
An exception to this is the agent services mode, which is designed to spin several jobs in parallel - this mode is used in agents running CPU-only tasks that require low resources (the default services agent running as part of the server deployment is such an example)
An exception to this is the agen
How else would it spin new instances in your account? 🙂
Hi SuccessfulKoala55 , just wondering – you mentioned the open-source autoscaler code version, but where is it hosted? I’ve only found https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py , but it looks like code to launch a managed auto-scaler instead 🙂
SparklingHedgehong28 since the application code launching the machines is running in the ClearML SaaS, we can't really use your IAM role 🙂
Hello Jake, sorry for the delay. I found the original auto scaler script and now understand more about how stuff works.
Can you please help understand how Clear assigns jobs to queues – will there be more than 1 job per machine at a single point in time? I do understand that if there are 2 agents running for each GPU, then a vacant agent will probably take a job from the queue. But what’s the general case here?
Thanks so much and sorry if this was already explained – feel free to point me to a doc.
Running the application code on you own is supported in scale and higher tiers, but you can always spin your own autoscaler code based on the open-source version 🙂
Why does the autoscaler app then ask for AWS credentials? 🙂
Hi SparklingHedgehong28 ,
also, when the AZ spec is left empty (it’s optional), machines fail to start with
Invalid type for parameter LaunchSpecification.Placement.AvailabilityZone, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
Checking this issue, when not specifying the AZ it should use an available one, will keep you posted about it
ok, I misread. The launch code runs in the SaaS, but it uses credentials to launch machines in our cloud. What stops it then from specifying an IAM role existing in our cloud? Isn’t this just an API call?
Regarding this one, as SuccessfulKoala55 mention, you can have a full iam role (without any credentials) in higher tiers, in the regular youll need the credentials for the auth using the boto3 commands for spin up, spin down, tags and such apis commands. The app currently is hosted by us, so you iam role won’t be really available
also, when the AZ spec is left empty (it’s optional), machines fail to start withInvalid type for parameter LaunchSpecification.Placement.AvailabilityZone, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
If you're using recent versions, you can use the queues.get_num_entries
endpoint to obtain the number of pending entries (jobs) in a queue, instead of pulling the entire queue contents
what stops adding ‘IAM Role’ to the list of parameters of the machine group? I can’t wrap my head around this.
Actually this option is also available with the iam role
version mention above (you can specify Arn or Name for the role)
Do you mean assigning an IAM role to an ec2 instance it launches?
Also, in scale and higher tiers I think the IAM role feature is also supported
Regarding this one, as
mention, you can have a full iam role (without any credentials) in higher tiers, in the regular youll need the credentials for the auth using the boto3 commands for spin up, spin down, tags and such apis commands. The app currently is hosted by us, so you iam role won’t be really available
With it the new created instance will have the iam role associate to it too
Thank you for the reply, Alon and looking into the issue. I’m not speaking of the app/launcher app itself inheriting IAM permissions by assuming the role – this is absolutely understandable as it’s in your cloud. Contrary, when launching an AWS auto-scaler app, what stops adding ‘IAM Role’ to the list of parameters of the machine group? I can’t wrap my head around this.
Yup, seems like it. Which Server version are you using?