Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Hello! I’m wondering if there is an option to run a termination hook script at the end of the docker job execution (such as https://clear.ml/docs/latest/docs/guides/docker/extra_docker_shell_script/ )? This would be super-useful to call the instance protection switch in a self-setup ASG in AWS. Any hacky ideas also welcome 🙂 Thank you.

  
  
Posted 2 years ago
Votes Newest

Answers 24


Hey AgitatedDove14 , basically https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-instance-protection.html would allow blocking the machine from being scaled-in when there is a scale-in event in the ASG.
The ASG is responsible for spinning up on demand in the ClearML queue, but spinning down is less trivial – we cannot just spin down is the queue is empty (some machine can still be running something important!)

  
  
Posted 2 years ago

It’s like a completion hook when the job terminates (whatever success or failure).
What I’m thinking of: instance scale-in in an ASG doesn’t happen if instance protection is enabled:
Agent fetches job and starts container; Instance protection enabled with API call ran in extra_docker_shell_script , job launched. Job finishes; Instance protection get disabled in this post-run hook , instance may be terminated.

  
  
Posted 2 years ago

Okay that kind of makes sense, now my followup question is how are you using the ASG? I mean the clearml autoscaler does not use it, so I just wonder on what the big picture, before we solve this little annoyance 🙂

  
  
Posted 2 years ago

Sounds okay... Will I need state to calculate the idle time over time, or there's some idle param in the API answer? Because ideally I'd run this in a stateless lambda.

  
  
Posted 2 years ago

Yes, exactly this.

  
  
Posted 2 years ago

SparklingHedgehong28 this is actually quite cool! Still not sure why not just use the built in autoscaler https://github.com/allegroai/clearml/tree/master/examples/services/aws-autoscaler , but it is a really cool usage of ASG 🤩

  
  
Posted 2 years ago

T hanks. I guess there are too many moving parts in the official implementation that need adaptation, and wrap up – such as the use of credentials instead of IAM, since it's designed to work cross-cloud (or cloud-agnostic), hence for us it's easier to reimpl the wheel. 🙃

  
  
Posted 2 years ago

Hi SparklingHedgehong28
What would be the use for "end of docker hook" ? is this like an abort callback? completion ?

instance protection

Do you mean like when instance just died (line spot in AWS) ?

  
  
Posted 2 years ago

Lambda’s are designed to be short-lived, I don’t think it’s a fine idea to run it in a loop TBH.

Yeah, you are right, but maybe it would be fine to launch, have the lambda run for 30-60sec (i.e. checking idle time for 1 min, stateless, only keeping track inside the execution context) then take it down)
What I'm trying to solve here, is (1) quick way to understand if the agent is actually idling or just between Tasks (2) still avoid having the "idle watchdog" short lived, to that it can be launched with a lambda function once every X min.
wdyt?

  
  
Posted 2 years ago

Hey AgitatedDove14 , thanks for having this discussion 🙂
We are collecting machine/task data from ClearML API using a Lambda and push it to CloudWatch as 1 or 0 datapoints per-machine, for a machine doing work or not accordingly. Another lambda, run on an ASG termination event, compares the incoming machine list with the list of machines from CW which are not running anything for x minutes and return the intersection. The ASG then terminates only machines doing nothing during the last period. This doesn’t account for a race condition, so perhaps a task still could be assigned at the last moment, just before termination. But this could be better tuned with more dense data points in CloudWatch.

  
  
Posted 2 years ago

And maybe adding idle time spent without a job to API is not that a bad idea 😉
yes, adding that to the feature list 🙂

What if I write the last active state in an instance tag? This could be a solution…

I love this hack, yes this should just work.
BTW: if you lambda is a for loop that is constantly checking there is no need to actually store "last idle timestamp check as tag", no?

  
  
Posted 2 years ago

A single query will return if the agent is running anything, and for how long, but I do not think you can get the idle time ...

  
  
Posted 2 years ago

And maybe adding idle time spent without a job to API is not that a bad idea 😉

  
  
Posted 2 years ago

Lambda’s are designed to be short-lived, I don’t think it’s a fine idea to run it in a loop TBH.

  
  
Posted 2 years ago

Tricky question!

I see this asg with a TargetTrackingPolicy for both scale up (if queue size >0) and down, but scale down goes (additionally or only) through a custom policy – check if specific machine can be shutdown. For this we need to make sure there's no job running there. Two ways to do it –

  1. Instance protection set on/off which is simple.
  2. Compare machines that the ASG wants to shutdown with machines having tasks {} retrieved from the API. If task is running, avoid shutting down this machine. This can happen in a Lambda function.
  
  
Posted 2 years ago

So this is not for end-user convenience like sending slack messages, but rather system-related hooks useful for auto-scaling, internal API’s and such. If this functionality is not available out of the box, we’d need to resort to looking into scaling-in in a different way. We think of:
Scaling in on very low group average CPU/GPU usage. Non-reliable, because a machine could be running data uploading or else low-load work. Using an https://docs.aws.amazon.com/autoscaling/ec2/userguide/lambda-custom-termination-policy.html to compare the lambda input list with the ClearML API call results to eliminate machines still having jobs assigned to. More work needed and questionable if reliable.

  
  
Posted 2 years ago

basically

would allow blocking the machine from being scaled-in when

Oh this is what I was missing 🙂 That makes sense to me!
So what you are saying is that the AWS autoscaler agent, when it is launching a Task, inside the container you will set "protection flag" when the Task ends, you will unset "protection flag"
Is that correct?

  
  
Posted 2 years ago

This a re-implementation I'd say.
Every instance is running an agent in docker mode. One agent = one task for autoscaling purposes.

  
  
Posted 2 years ago

Thanks SparklingHedgehong28
So I think I'm missing information on what you call "Instance protection" ?
You mean like respining spot instances ? or is it away to review the performance of AWS ASG (i.e. like a watchdog of a sort) ?

  
  
Posted 2 years ago

Ohh I see, so basically the ASG should check if the agent is Idle, rather than the Task is running ?

  
  
Posted 2 years ago

So this should be easier to implement, and would probably be safer.
You can basically query all the workers (i.e. agents) and check if they are running a Task, then if they are not (for a while) remove the "protection flag"
wdyt?

  
  
Posted 2 years ago

Thank you, Martin. Probably then a simple Lambda that constantly monitors the workers and sets/unsets the protection flag should work. Though I’d avoid writing timestamp to any kind of state. What if I write the last active state in an instance tag? This could be a solution…
w = get_clearml_workers() for instance in w: if instance['processing_job'] is True: instance_tag['last_job_seen'] = current_time() else: compare_times_and_allow_shutdown_if_idle() ...

  
  
Posted 2 years ago

Hmm, so this is kind of a hack for ClearML AWS autoscaling ?
and every instance is running an agent? or a single Task?

  
  
Posted 2 years ago

Yes, why not. I think it's also an option.

  
  
Posted 2 years ago
998 Views
24 Answers
2 years ago
one year ago
Tags