Ah, the 2 second grace period answers a question I had. I tried to hijack the Tasks's signal handler to see if I can do my exit cleanup then run the Task's handler, but it didn't seem to work. I think I must have triggered the 2s cooldown and had my task terminated.
I think I can work around this right now by running my tasks manually without trains-agent, but I'd love a way to do something on exit. AgitatedDove14 I'd be happy to create an issue. I think the solution might be a bit more involved as a callback because the signal handler might be called in the same thread that also handles the cleanup. As an example, I'm using ignite and in the signal handler calling the
terminate() function on the engine. Whatever graceful exit handler that's implemented would need to be able to handle the asynchronicity between the signal handler returning and the script terminating some time after.
SillyPuppy19 yes you are correct, actually I can promise you the callback will be called from a different thread (basically the monitoring thread) so it's on the user to make sure the callback can handle it .
How about we move this discussion to GitHub?
Sounds good AgitatedDove14 . I'll get an issue started. Thanks for the discussion!
Many thanks 🙂
SuccessfulKoala55 that's good to know. I moved the signal register handles above the call to
Task.init() as you suggested. This is what I should be seeing when the script is terminated manually:
I0526 07:46:14.391154 140262441822016 engine.py:837] Engine run starting with max_epochs=100. I0526 07:46:14.542132 140262441822016 train_utils.py:223] Epoch Iter Loss: 0.43599218130111694 I0526 07:46:24.078526 140262441822016 train_utils.py:46] 2 signal intercepted. I0526 07:46:24.078753 140262441822016 engine.py:635] Terminate signaled. Engine will stop after current iteration is finished.
However what I see is the following:
I0526 07:44:15.416634 140574824470336 engine.py:837] Engine run starting with max_epochs=100. I0526 07:44:15.517145 140574824470336 train_utils.py:223] Epoch Iter Loss: 0.43599218130111694 2020-05-26 07:44:36 User aborted: stopping task (1)Once the task is aborted there doesn't seem to be any more log output from the script. That might be because trains is cutting off the log, but I also don't see the model file saved anywhere. I'll keep looking, but thank you for the suggestion!
SillyPuppy19 I think this is a great idea, basically having the ability to have a callback function called before aborting/exiting the process.
Unfortunately today abort will give the process 2 seconds to gracefully quit and then it kills the process. It was not designed to just send an abort signal, as these will more often than not, will not actually terminate the process.
Any chance I can ask you to open a GitHub Issue and suggest the callback feature. I have a feeling a few more users will like that ability. WDYT?
AgitatedDove14 I'm definitely after a graceful abort from a long experiment. I don't necessarily want to throw the state away but I don't want to have to recover everything from checkpoints, hence the save-on-terminate. If there's another way I should be looking at it I'd love to get your thoughts.
Hi SillyPuppy19 ,
The trains-agent does call all other hooks registered for SIGINT/SIGTERM - can you make sure you register your hook before calling
SillyPuppy19 are you aborting the experiment or are you trying to protect crash? Is it like a callback functionality you are looking for?