SparklingHedgehong28

6 Questions, 31 Answers

Active since 10 January 2023

Last activity 2 years ago

Reputation

Badges 1

28 × Eureka!

Questions 6
Answers 31

0 Votes

2 Answers

2K Views

0 Votes 2 Answers 2K Views

Hello, Is There A Way To Create Another Admin User On Pro Plan In The Saas? Alternatively I Can Rename And Change E-Mail Of The Current One, But It’S Used Sso Originally

Hello, is there a way to create another admin user on Pro plan in the Saas? Alternatively I can rename and change e-mail of the current one, but it’s used SS...

clearml

3 years ago

0 Votes

24 Answers

2K Views

0 Votes 24 Answers 2K Views

Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Hello! I’m wondering if there is an option to run a termination hook script at the end of the docker job execution (such as https://clear.ml/docs/latest/docs...

clearml

3 years ago

0 Votes

6 Answers

2K Views

0 Votes 6 Answers 2K Views

Hello, I’M Seeing This Issue When Launching A Machine With The App –

Hello, I’m seeing this issue when launching a machine with the app – 2022-07-05 09:55:49,759 - usage_reporter - INFO - Sending usage report for 60 usage seco...

clearml

3 years ago

0 Votes

2 Answers

2K Views

0 Votes 2 Answers 2K Views

Hello! A Question

Hello! A question https://clearml.slack.com/archives/CTK20V944/p1656420047506249 . Is it possible to get an untracked file uploaded when remote agent executi...

mlops

3 years ago

0 Votes

3 Answers

2K Views

0 Votes 3 Answers 2K Views

Hello Folks, We’Re A Research Organisation And Do Want To Evaluate The Pro Features. Is There Any Way We Could Get That Enabled In The Saas For Us? I Do Not See The Support Functions / Chat. I Didn’T Get A Reply To My E-Mail From The Clear.Ml Team.

Hello folks, we’re a research organisation and do want to evaluate the Pro features. Is there any way we could get that enabled in the SaaS for us? I do not ...

clearml

3 years ago

0 Votes

23 Answers

2K Views

0 Votes 23 Answers 2K Views

Hello, A Question On Using The Aws Autoscaler App In The Saas – Can A Custom Iam Role Be Used For Launching The Machines? I’M Looking At Giving Custom Permissions To The Machine For Checking Out Secrets / Code / Etc.

Hello, a question on using the AWS Autoscaler app in the SaaS – can a custom IAM role be used for launching the machines? I’m looking at giving custom permis...

clearml

3 years ago

0 If I Have 1 Machine With A Gpu, Can I Put A Worker On It With Gpu And Two Workers With

I'd assume not, trust John here.

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Hey AgitatedDove14 , thanks for having this discussion 🙂
We are collecting machine/task data from ClearML API using a Lambda and push it to CloudWatch as 1 or 0 datapoints per-machine, for a machine doing work or not accordingly. Another lambda, run on an ASG termination event, compares the incoming machine list with the list of machines from CW which are not running anything for x minutes and return the intersection. The ASG then terminates only machines doing nothing during the last per...

3 years ago

0 Hello, Is There A Way To Create Another Admin User On Pro Plan In The Saas? Alternatively I Can Rename And Change E-Mail Of The Current One, But It’S Used Sso Originally

sure!

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Yes, why not. I think it's also an option.

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Sounds okay... Will I need state to calculate the idle time over time, or there's some idle param in the API answer? Because ideally I'd run this in a stateless lambda.

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Yes, exactly this.

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

And maybe adding idle time spent without a job to API is not that a bad idea 😉

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Hey AgitatedDove14 , basically https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-instance-protection.html would allow blocking the machine from being scaled-in when there is a scale-in event in the ASG.
The ASG is responsible for spinning up on demand in the ClearML queue, but spinning down is less trivial – we cannot just spin down is the queue is empty (some machine can still be running something important!)

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Lambda’s are designed to be short-lived, I don’t think it’s a fine idea to run it in a loop TBH.

3 years ago

0 Hello Folks, We’Re A Research Organisation And Do Want To Evaluate The Pro Features. Is There Any Way We Could Get That Enabled In The Saas For Us? I Do Not See The Support Functions / Chat. I Didn’T Get A Reply To My E-Mail From The Clear.Ml Team.

Hi folks, thanks for replies. We’ll go and upgrade to try.

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

This a re-implementation I'd say.
Every instance is running an agent in docker mode. One agent = one task for autoscaling purposes.

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Thank you, Martin. Probably then a simple Lambda that constantly monitors the workers and sets/unsets the protection flag should work. Though I’d avoid writing timestamp to any kind of state. What if I write the last active state in an instance tag? This could be a solution…
w = get_clearml_workers() for instance in w: if instance['processing_job'] is True: instance_tag['last_job_seen'] = current_time() else: compare_times_and_allow_shutdown_if_idle() ...

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

T hanks. I guess there are too many moving parts in the official implementation that need adaptation, and wrap up – such as the use of credentials instead of IAM, since it's designed to work cross-cloud (or cloud-agnostic), hence for us it's easier to reimpl the wheel. 🙃

3 years ago

0 If I Have 1 Machine With A Gpu, Can I Put A Worker On It With Gpu And Two Workers With

Perhaps you may if can you assign CPU cores to the agent? I’d avoid doing GPU and CPU at the same time as my GPU load creates a decent CPU load by itself.

3 years ago

0 Hi, I Attached An Iam Role To An Ec2 Instance To Grant Access To An S3 Bucket. The Ec2 Instance Is Running A Clearml-Agent (V1.1.0). I Didn’T Specify Any Key/Secret For Clearml. The Tasks Fail With The Following Error:

JitteryCoyote63 so you don’t need to use creds anymore?

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

So this is not for end-user convenience like sending slack messages, but rather system-related hooks useful for auto-scaling, internal API’s and such. If this functionality is not available out of the box, we’d need to resort to looking into scaling-in in a different way. We think of:
Scaling in on very low group average CPU/GPU usage. Non-reliable, because a machine could be running data uploading or else low-load work. Using an https://docs.aws.amazon.com/autoscaling/ec2/userguide/lambda-c...

3 years ago

0 Hello, I’M Seeing This Issue When Launching A Machine With The App –

It’s a clone of a previous one, since I’ve failed -> cloned -> changed params -> failed -> clone -> …

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

It’s like a completion hook when the job terminates (whatever success or failure).
What I’m thinking of: instance scale-in in an ASG doesn’t happen if instance protection is enabled:
Agent fetches job and starts container; Instance protection enabled with API call ran in extra_docker_shell_script , job launched. Job finishes; Instance protection get disabled in this post-run hook , instance may be terminated.

3 years ago

0 Hello, I’M Seeing This Issue When Launching A Machine With The App –

Nope, this is the ‘autoscaler app’ from the web interface of the SaaS. Nothing self-hosted at the moment.

3 years ago

0 Hello! A Question

Hi CostlyOstrich36 Thank you for reaching back and sorry for my embarrassingly long answer here.
We launch the job on a remote executor this way –
From a repository where the code lies we launch a job with clearml-task in this form: ` clearml-task --name projx-1-debug-$(date +%s) --project kbc --folder . --script projx/kbcrun.py --requirements custom-requirements.txt --docker python:3.9.13 --task-type training --queue aws-cpu-docker --args graph_dir=not/needed config_path=examples/tr...

3 years ago

0 Hello, A Question On Using The Aws Autoscaler App In The Saas – Can A Custom Iam Role Be Used For Launching The Machines? I’M Looking At Giving Custom Permissions To The Machine For Checking Out Secrets / Code / Etc.

Hello Jake, sorry for the delay. I found the original auto scaler script and now understand more about how stuff works.
Can you please help understand how Clear assigns jobs to queues – will there be more than 1 job per machine at a single point in time? I do understand that if there are 2 agents running for each GPU, then a vacant agent will probably take a job from the queue. But what’s the general case here?
Thanks so much and sorry if this was already explained – feel free to point me to...

3 years ago

We don't, we use the SaaS.
Exactly, that is the call I implemented and wrapped it into some serverless code to export data to CloudWatch.

3 years ago

0 Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Tricky question!

I see this asg with a TargetTrackingPolicy for both scale up (if queue size >0) and down, but scale down goes (additionally or only) through a custom policy – check if specific machine can be shutdown. For this we need to make sure there's no job running there. Two ways to do it –

Instance protection set on/off which is simple.
Compare machines that the ASG wants to shutdown with machines having tasks {} retrieved from the API. If task is running, avoid ...

3 years ago

Yes. A pre-created one.

3 years ago

also, when the AZ spec is left empty (it’s optional), machines fail to start with
Invalid type for parameter LaunchSpecification.Placement.AvailabilityZone, value: None, type: <class 'NoneType'>, valid types: <class 'str'>

3 years ago

Hi SuccessfulKoala55 , just wondering – you mentioned the open-source autoscaler code version, but where is it hosted? I’ve only found https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py , but it looks like code to launch a managed auto-scaler instead 🙂

3 years ago

Why does the autoscaler app then ask for AWS credentials? 🙂

3 years ago

ok, I misread. The launch code runs in the SaaS, but it uses credentials to launch machines in our cloud. What stops it then from specifying an IAM role existing in our cloud? Isn’t this just an API call?

3 years ago

Thank you for the reply, Alon and looking into the issue. I’m not speaking of the app/launcher app itself inheriting IAM permissions by assuming the role – this is absolutely understandable as it’s in your cloud. Contrary, when launching an AWS auto-scaler app, what stops adding ‘IAM Role’ to the list of parameters of the machine group? I can’t wrap my head around this.

3 years ago

Thank you for explaining. As we are implementing a custom auto-scaler for the machines, it looks like the queue length - or a rate of messages falling into the queue - are the best indicators for scaling up or down.

3 years ago

Show more results