SillyPuppy19

2 Questions, 7 Answers

Active since 10 January 2023

Last activity one year ago

Reputation

Badges 1

7 × Eureka!

Questions 2
Answers 7

0 Votes

4 Answers

1K Views

0 Votes 4 Answers 1K Views

Another Conundrum: I Have A Single Script That Launches Training Jobs For Various Models. It Does This By Accepting A Flag Which Is The Model Name, And Dynamically Loading The Module To Train It. This Didn'T Mesh Well With Trains, Because The Project And

Another conundrum: I have a single script that launches training jobs for various models. It does this by accepting a flag which is the model name, and dynam...

clearml

4 years ago

0 Votes

10 Answers

997 Views

0 Votes 10 Answers 997 Views

Hi Everyone! Quick Question: I Have A Script That Allows The Model To Be Saved Out In Case Of An Early Exit. At The Moment The Script Is Catching The Sigint And Sigterm Signals, Ending The Training And Writing Out The Model. I Understand I Could Use Check

Hi Everyone! Quick question: I have a script that allows the model to be saved out in case of an early exit. At the moment the script is catching the SIGINT ...

clearml

4 years ago

0 Hi Everyone! Quick Question: I Have A Script That Allows The Model To Be Saved Out In Case Of An Early Exit. At The Moment The Script Is Catching The Sigint And Sigterm Signals, Ending The Training And Writing Out The Model. I Understand I Could Use Check

https://github.com/allegroai/trains-agent/issues/20

4 years ago

0 Another Conundrum: I Have A Single Script That Launches Training Jobs For Various Models. It Does This By Accepting A Flag Which Is The Model Name, And Dynamically Loading The Module To Train It. This Didn'T Mesh Well With Trains, Because The Project And

That's great to know. Thank you AgitatedDove14 . I might have gone wrong somewhere else, so I'll double-check.

4 years ago

Ah, the 2 second grace period answers a question I had. I tried to hijack the Tasks's signal handler to see if I can do my exit cleanup then run the Task's handler, but it didn't seem to work. I think I must have triggered the 2s cooldown and had my task terminated.

I think I can work around this right now by running my tasks manually without trains-agent, but I'd love a way to do something on exit. AgitatedDove14 I'd be happy to create an issue. I think the solution might be a bit more in...

4 years ago

Sounds good AgitatedDove14 . I'll get an issue started. Thanks for the discussion!

4 years ago

AgitatedDove14 sorry if that wasn't clear. I think the issue is that when trains-agent runs the script, none of the flag values are set until the Task object is initialized. For that to happen, the task object needs to know what project/task to connect to, which I presume is via the project_name and task_name parameters.

If those parameters are themselves dependent on flags, then they will be uninitialized when trains-agent runs the script, as it does not run it with any comman...

4 years ago

SuccessfulKoala55 that's good to know. I moved the signal register handles above the call to Task.init() as you suggested. This is what I should be seeing when the script is terminated manually:

` I0526 07:46:14.391154 140262441822016 engine.py:837] Engine run starting with max_epochs=100.
I0526 07:46:14.542132 140262441822016 train_utils.py:223] Epoch[1] Iter[1] Loss: 0.43599218130111694
I0526 07:46:24.078526 140262441822016 train_utils.py:46] 2 signal intercepted.
I0526 07:46:24.078...

4 years ago

AgitatedDove14 I'm definitely after a graceful abort from a long experiment. I don't necessarily want to throw the state away but I don't want to have to recover everything from checkpoints, hence the save-on-terminate. If there's another way I should be looking at it I'd love to get your thoughts.

4 years ago