Hi @<1571308079511769088:profile|GentleParrot65> , ideally you shouldn't be terminating instances manually. However you mean that the autoscaler spins down a machine and still recognizes it as running and refuses to spin up a new machine?
Yes. I’ve done some debugging and discovered that process started from user-data script doesn’t receive SIGTERM on instance termination. So worker is unable to gracefully shutdown and unregister.
More investigation showed, that there is a problem with cloud init. When I connect via ssh and start process with “nohup python … & “, everything works, process receives SIGTERM on instance termination. Process started with could init (user data script) receives no signals on instance termination (but it receives signals send with kill <pid>). I’ve tried following:
- start with nohup python3 -m clearml-agent … &
- start agent with --detached flag. Nothing works. So it looks like a bug.