Hello! I’M Wondering If There Is An Option To Run A Termination Hook Script

Answered

Hello! I’m wondering if there is an option to run a termination hook script at the end of the docker job execution (such as https://clear.ml/docs/latest/docs/guides/docker/extra_docker_shell_script/ )? This would be super-useful to call the instance protection switch in a self-setup ASG in AWS. Any hacky ideas also welcome 🙂 Thank you.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

Votes Newest

Answers 24

Hi SparklingHedgehong28
What would be the use for "end of docker hook" ? is this like an abort callback? completion ?

instance protection

Do you mean like when instance just died (line spot in AWS) ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

It’s like a completion hook when the job terminates (whatever success or failure).
What I’m thinking of: instance scale-in in an ASG doesn’t happen if instance protection is enabled:
Agent fetches job and starts container; Instance protection enabled with API call ran in extra_docker_shell_script , job launched. Job finishes; Instance protection get disabled in this post-run hook , instance may be terminated.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

So this is not for end-user convenience like sending slack messages, but rather system-related hooks useful for auto-scaling, internal API’s and such. If this functionality is not available out of the box, we’d need to resort to looking into scaling-in in a different way. We think of:
Scaling in on very low group average CPU/GPU usage. Non-reliable, because a machine could be running data uploading or else low-load work. Using an https://docs.aws.amazon.com/autoscaling/ec2/userguide/lambda-custom-termination-policy.html to compare the lambda input list with the ClearML API call results to eliminate machines still having jobs assigned to. More work needed and questionable if reliable.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

Thanks SparklingHedgehong28
So I think I'm missing information on what you call "Instance protection" ?
You mean like respining spot instances ? or is it away to review the performance of AWS ASG (i.e. like a watchdog of a sort) ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hey AgitatedDove14 , basically https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-instance-protection.html would allow blocking the machine from being scaled-in when there is a scale-in event in the ASG.
The ASG is responsible for spinning up on demand in the ClearML queue, but spinning down is less trivial – we cannot just spin down is the queue is empty (some machine can still be running something important!)

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

basically

would allow blocking the machine from being scaled-in when

Oh this is what I was missing 🙂 That makes sense to me!
So what you are saying is that the AWS autoscaler agent, when it is launching a Task, inside the container you will set "protection flag" when the Task ends, you will unset "protection flag"
Is that correct?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, exactly this.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

Okay that kind of makes sense, now my followup question is how are you using the ASG? I mean the clearml autoscaler does not use it, so I just wonder on what the big picture, before we solve this little annoyance 🙂

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Tricky question!

I see this asg with a TargetTrackingPolicy for both scale up (if queue size >0) and down, but scale down goes (additionally or only) through a custom policy – check if specific machine can be shutdown. For this we need to make sure there's no job running there. Two ways to do it –

Instance protection set on/off which is simple.
Compare machines that the ASG wants to shutdown with machines having tasks {} retrieved from the API. If task is running, avoid shutting down this machine. This can happen in a Lambda function.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

Hmm, so this is kind of a hack for ClearML AWS autoscaling ?
and every instance is running an agent? or a single Task?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

This a re-implementation I'd say.
Every instance is running an agent in docker mode. One agent = one task for autoscaling purposes.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

Ohh I see, so basically the ASG should check if the agent is Idle, rather than the Task is running ?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, why not. I think it's also an option.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

So this should be easier to implement, and would probably be safer.
You can basically query all the workers (i.e. agents) and check if they are running a Task, then if they are not (for a while) remove the "protection flag"
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sounds okay... Will I need state to calculate the idle time over time, or there's some idle param in the API answer? Because ideally I'd run this in a stateless lambda.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

A single query will return if the agent is running anything, and for how long, but I do not think you can get the idle time ...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thank you, Martin. Probably then a simple Lambda that constantly monitors the workers and sets/unsets the protection flag should work. Though I’d avoid writing timestamp to any kind of state. What if I write the last active state in an instance tag? This could be a solution…
w = get_clearml_workers() for instance in w: if instance['processing_job'] is True: instance_tag['last_job_seen'] = current_time() else: compare_times_and_allow_shutdown_if_idle() ...

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

And maybe adding idle time spent without a job to API is not that a bad idea 😉

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

And maybe adding idle time spent without a job to API is not that a bad idea 😉
yes, adding that to the feature list 🙂

What if I write the last active state in an instance tag? This could be a solution…

I love this hack, yes this should just work.
BTW: if you lambda is a for loop that is constantly checking there is no need to actually store "last idle timestamp check as tag", no?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Lambda’s are designed to be short-lived, I don’t think it’s a fine idea to run it in a loop TBH.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

Lambda’s are designed to be short-lived, I don’t think it’s a fine idea to run it in a loop TBH.

Yeah, you are right, but maybe it would be fine to launch, have the lambda run for 30-60sec (i.e. checking idle time for 1 min, stateless, only keeping track inside the execution context) then take it down)
What I'm trying to solve here, is (1) quick way to understand if the agent is actually idling or just between Tasks (2) still avoid having the "idle watchdog" short lived, to that it can be launched with a lambda function once every X min.
wdyt?

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hey AgitatedDove14 , thanks for having this discussion 🙂
We are collecting machine/task data from ClearML API using a Lambda and push it to CloudWatch as 1 or 0 datapoints per-machine, for a machine doing work or not accordingly. Another lambda, run on an ASG termination event, compares the incoming machine list with the list of machines from CW which are not running anything for x minutes and return the intersection. The ASG then terminates only machines doing nothing during the last period. This doesn’t account for a race condition, so perhaps a task still could be assigned at the last moment, just before termination. But this could be better tuned with more dense data points in CloudWatch.

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

SparklingHedgehong28 this is actually quite cool! Still not sure why not just use the built in autoscaler https://github.com/allegroai/clearml/tree/master/examples/services/aws-autoscaler , but it is a really cool usage of ASG 🤩

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

T hanks. I guess there are too many moving parts in the official implementation that need adaptation, and wrap up – such as the use of credentials instead of IAM, since it's designed to work cross-cloud (or cloud-agnostic), hence for us it's easier to reimpl the wheel. 🙃

  				
Posted 
	3 years ago

					More
				  		
  Report
		
					SparklingHedgehong28
				
					0
					 × 1

Write your answer

2K Views

24 Answers

3 years ago

2 years ago