Reputation
Badges 1
7 × Eureka!That's great to know. Thank you AgitatedDove14 . I might have gone wrong somewhere else, so I'll double-check.
Ah, the 2 second grace period answers a question I had. I tried to hijack the Tasks's signal handler to see if I can do my exit cleanup then run the Task's handler, but it didn't seem to work. I think I must have triggered the 2s cooldown and had my task terminated.
I think I can work around this right now by running my tasks manually without trains-agent, but I'd love a way to do something on exit. AgitatedDove14 I'd be happy to create an issue. I think the solution might be a bit more in...
Sounds good AgitatedDove14 . I'll get an issue started. Thanks for the discussion!
AgitatedDove14 sorry if that wasn't clear. I think the issue is that when trains-agent runs the script, none of the flag values are set until the Task object is initialized. For that to happen, the task object needs to know what project/task to connect to, which I presume is via the project_name
and task_name
parameters.
If those parameters are themselves dependent on flags, then they will be uninitialized when trains-agent runs the script, as it does not run it with any comman...
SuccessfulKoala55 that's good to know. I moved the signal register handles above the call to Task.init()
as you suggested. This is what I should be seeing when the script is terminated manually:
` I0526 07:46:14.391154 140262441822016 engine.py:837] Engine run starting with max_epochs=100.
I0526 07:46:14.542132 140262441822016 train_utils.py:223] Epoch[1] Iter[1] Loss: 0.43599218130111694
I0526 07:46:24.078526 140262441822016 train_utils.py:46] 2 signal intercepted.
I0526 07:46:24.078...
AgitatedDove14 I'm definitely after a graceful abort from a long experiment. I don't necessarily want to throw the state away but I don't want to have to recover everything from checkpoints, hence the save-on-terminate. If there's another way I should be looking at it I'd love to get your thoughts.