Reputation
Badges 1
2 × Eureka!So does this mean, that there is no workaround for bug described by H4dr1en when using app.clear.ml ?
Yes. I’ve done some debugging and discovered that process started from user-data script doesn’t receive SIGTERM on instance termination. So worker is unable to gracefully shutdown and unregister.
Thank you, for your answer.
aws_autoscaler.py works as follows (based on my experiments):
- let’s assume that the instance and the worker is started
- there are no tasks running on the worker for max_idle_time_min
- autoscaler terminates the instance
- worker stops sending updates to app.clear.ml
- worker is still shown on the ui with message “Update Time a few minutes ago”
- autoscaler thinks that this worker is still idle because it’s returned via workers.get_all
- when I enqueue task in t...
@<1523701087100473344:profile|SuccessfulKoala55> Is it possible to change this parameter on app.clear.ml ?
More investigation showed, that there is a problem with cloud init. When I connect via ssh and start process with “nohup python … & “, everything works, process receives SIGTERM on instance termination. Process started with could init (user data script) receives no signals on instance termination (but it receives signals send with kill <pid>). I’ve tried following:
- start with nohup python3 -m clearml-agent … &
- start agent with --detached flag. Nothing works. So it looks like a bug.