Hello Everyone, I’M Currently Facing An Issue While Using Cloud Clearml With Aws_Autoscaler.Py. Occasionally, Some Workers Become Unusable When An Ec2 Instance Is Terminated, Either Manually Or By Aws_Autoscaler.Py. These Workers Are Displayed In The Ui W

Answered

Hello everyone, I’m currently facing an issue while using Cloud ClearML with aws_autoscaler.py. Occasionally, some workers become unusable when an EC2 instance is terminated, either manually or by aws_autoscaler.py. These workers are displayed in the UI with the message “Update Time N minutes ago”. The main problem is that these workers block the entire queue, preventing the start of new tasks. When I enqueue a new task, it remains pending because the autoscaler recognizes the existing worker and doesn’t attempt to start a new EC2 instance. As a result, the only solution is to wait for the timeout of 10 minutes until the worker is removed by app.clear.ml.
Solutions I’ve considered:

I’ve tried removing the worker programmatically using the “workers.unregister” method. However, it only works within the same session as the workers.register. Please note that I last checked this functionality a year ago, so it might have changed since then.
The 10-minute interval for the timeout is not configurable and cannot be changed in app.clear.ml.
While I appreciate the convenience of the cloud service, I’m hesitant to deploy an on-premise version of app.clear.ml.
If anyone knows of a workaround for this issue, please let me know. Your assistance would be greatly appreciated. Thank you.

  				
Posted 
	one year ago

					More  		
  Report
		
					GentleParrot65
				
					0
					 × 1

Votes Newest

Answers 3

Yes. I’ve done some debugging and discovered that process started from user-data script doesn’t receive SIGTERM on instance termination. So worker is unable to gracefully shutdown and unregister.

  				
Posted 
	one year ago

					More  		
  Report
		
					GentleParrot65
				
					0
					 × 1

More investigation showed, that there is a problem with cloud init. When I connect via ssh and start process with “nohup python … & “, everything works, process receives SIGTERM on instance termination. Process started with could init (user data script) receives no signals on instance termination (but it receives signals send with kill <pid>). I’ve tried following:

start with nohup python3 -m clearml-agent … &
start agent with --detached flag. Nothing works. So it looks like a bug.

  				
Posted 
	one year ago

					More  		
  Report
		
					GentleParrot65
				
					0
					 × 1

Hi GentleParrot65 , ideally you shouldn't be terminating instances manually. However you mean that the autoscaler spins down a machine and still recognizes it as running and refuses to spin up a new machine?

  				
Posted 
	one year ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

3 Answers

one year ago