Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi Everyone! Quick Question: I Have A Script That Allows The Model To Be Saved Out In Case Of An Early Exit. At The Moment The Script Is Catching The Sigint And Sigterm Signals, Ending The Training And Writing Out The Model. I Understand I Could Use Check

Hi Everyone! Quick question: I have a script that allows the model to be saved out in case of an early exit. At the moment the script is catching the SIGINT and SIGTERM signals, ending the training and writing out the model. I understand I could use checkpoints, but I'd rather write out the model in a cleaner way on exit to a destination of choice. I was hoping to have this functionality work with the trains-agent's abort function, but it seems to be killing off the script in a more permanent way, maybe with SIGKILL? I'm wondering if the functionality I'm looking for is compatible with the way train-agent works? Thanks!

  
  
Posted 4 years ago
Votes Newest

Answers 10


Sounds good AgitatedDove14 . I'll get an issue started. Thanks for the discussion!

  
  
Posted 4 years ago

SillyPuppy19 yes you are correct, actually I can promise you the callback will be called from a different thread (basically the monitoring thread) so it's on the user to make sure the callback can handle it .
How about we move this discussion to GitHub?

  
  
Posted 4 years ago

Ah, the 2 second grace period answers a question I had. I tried to hijack the Tasks's signal handler to see if I can do my exit cleanup then run the Task's handler, but it didn't seem to work. I think I must have triggered the 2s cooldown and had my task terminated.

I think I can work around this right now by running my tasks manually without trains-agent, but I'd love a way to do something on exit. AgitatedDove14 I'd be happy to create an issue. I think the solution might be a bit more involved as a callback because the signal handler might be called in the same thread that also handles the cleanup. As an example, I'm using ignite and in the signal handler calling the terminate() function on the engine. Whatever graceful exit handler that's implemented would need to be able to handle the asynchronicity between the signal handler returning and the script terminating some time after.

  
  
Posted 4 years ago

Hi SillyPuppy19 ,
The trains-agent does call all other hooks registered for SIGINT/SIGTERM - can you make sure you register your hook before calling Task.init() ?

  
  
Posted 4 years ago

Many thanks 🙂

  
  
Posted 4 years ago

SillyPuppy19 I think this is a great idea, basically having the ability to have a callback function called before aborting/exiting the process.

Unfortunately today abort will give the process 2 seconds to gracefully quit and then it kills the process. It was not designed to just send an abort signal, as these will more often than not, will not actually terminate the process.

Any chance I can ask you to open a GitHub Issue and suggest the callback feature. I have a feeling a few more users will like that ability. WDYT?

  
  
Posted 4 years ago

AgitatedDove14 I'm definitely after a graceful abort from a long experiment. I don't necessarily want to throw the state away but I don't want to have to recover everything from checkpoints, hence the save-on-terminate. If there's another way I should be looking at it I'd love to get your thoughts.

  
  
Posted 4 years ago

SillyPuppy19 are you aborting the experiment or are you trying to protect crash? Is it like a callback functionality you are looking for?

  
  
Posted 4 years ago

SuccessfulKoala55 that's good to know. I moved the signal register handles above the call to Task.init() as you suggested. This is what I should be seeing when the script is terminated manually:

I0526 07:46:14.391154 140262441822016 engine.py:837] Engine run starting with max_epochs=100. I0526 07:46:14.542132 140262441822016 train_utils.py:223] Epoch[1] Iter[1] Loss: 0.43599218130111694 I0526 07:46:24.078526 140262441822016 train_utils.py:46] 2 signal intercepted. I0526 07:46:24.078753 140262441822016 engine.py:635] Terminate signaled. Engine will stop after current iteration is finished.
However what I see is the following:
I0526 07:44:15.416634 140574824470336 engine.py:837] Engine run starting with max_epochs=100. I0526 07:44:15.517145 140574824470336 train_utils.py:223] Epoch[1] Iter[1] Loss: 0.43599218130111694 2020-05-26 07:44:36 User aborted: stopping task (1)Once the task is aborted there doesn't seem to be any more log output from the script. That might be because trains is cutting off the log, but I also don't see the model file saved anywhere. I'll keep looking, but thank you for the suggestion!

  
  
Posted 4 years ago
994 Views
10 Answers
4 years ago
one year ago
Tags
Similar posts