Reputation
Badges 1
2 × Eureka!Thank you, for your answer.
aws_autoscaler.py works as follows (based on my experiments):
- let’s assume that the instance and the worker is started
- there are no tasks running on the worker for max_idle_time_min
- autoscaler terminates the instance
- worker stops sending updates to app.clear.ml
- worker is still shown on the ui with message “Update Time a few minutes ago”
- autoscaler thinks that this worker is still idle because it’s returned via workers.get_all
- when I enqueue task in t...
@<1523701087100473344:profile|SuccessfulKoala55> Is it possible to change this parameter on app.clear.ml ?
Yes. I’ve done some debugging and discovered that process started from user-data script doesn’t receive SIGTERM on instance termination. So worker is unable to gracefully shutdown and unregister.
So does this mean, that there is no workaround for bug described by H4dr1en when using app.clear.ml ?
More investigation showed, that there is a problem with cloud init. When I connect via ssh and start process with “nohup python … & “, everything works, process receives SIGTERM on instance termination. Process started with could init (user data script) receives no signals on instance termination (but it receives signals send with kill <pid>). I’ve tried following:
- start with nohup python3 -m clearml-agent … &
- start agent with --detached flag. Nothing works. So it looks like a bug.