Hi Everyone! Quick Question: I Have A Script That Allows The Model To Be Saved Out In Case Of An Early Exit. At The Moment The Script Is Catching The Sigint And Sigterm Signals, Ending The Training And Writing Out The Model. I Understand I Could Use Check

Answered

Hi Everyone! Quick question: I have a script that allows the model to be saved out in case of an early exit. At the moment the script is catching the SIGINT and SIGTERM signals, ending the training and writing out the model. I understand I could use checkpoints, but I'd rather write out the model in a cleaner way on exit to a destination of choice. I was hoping to have this functionality work with the trains-agent's abort function, but it seems to be killing off the script in a more permanent way, maybe with SIGKILL? I'm wondering if the functionality I'm looking for is compatible with the way train-agent works? Thanks!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SillyPuppy19
				
					0
					 × 1

Votes Newest

Answers 10

SillyPuppy19 yes you are correct, actually I can promise you the callback will be called from a different thread (basically the monitoring thread) so it's on the user to make sure the callback can handle it .
How about we move this discussion to GitHub?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Ah, the 2 second grace period answers a question I had. I tried to hijack the Tasks's signal handler to see if I can do my exit cleanup then run the Task's handler, but it didn't seem to work. I think I must have triggered the 2s cooldown and had my task terminated.

I think I can work around this right now by running my tasks manually without trains-agent, but I'd love a way to do something on exit. AgitatedDove14 I'd be happy to create an issue. I think the solution might be a bit more involved as a callback because the signal handler might be called in the same thread that also handles the cleanup. As an example, I'm using ignite and in the signal handler calling the terminate() function on the engine. Whatever graceful exit handler that's implemented would need to be able to handle the asynchronicity between the signal handler returning and the script terminating some time after.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SillyPuppy19
				
					0
					 × 1

Hi SillyPuppy19 ,
The trains-agent does call all other hooks registered for SIGINT/SIGTERM - can you make sure you register your hook before calling Task.init() ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

https://github.com/allegroai/trains-agent/issues/20

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SillyPuppy19
				
					0
					 × 1

Many thanks 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

SillyPuppy19 I think this is a great idea, basically having the ability to have a callback function called before aborting/exiting the process.

Unfortunately today abort will give the process 2 seconds to gracefully quit and then it kills the process. It was not designed to just send an abort signal, as these will more often than not, will not actually terminate the process.

Any chance I can ask you to open a GitHub Issue and suggest the callback feature. I have a feeling a few more users will like that ability. WDYT?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 I'm definitely after a graceful abort from a long experiment. I don't necessarily want to throw the state away but I don't want to have to recover everything from checkpoints, hence the save-on-terminate. If there's another way I should be looking at it I'd love to get your thoughts.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SillyPuppy19
				
					0
					 × 1

SillyPuppy19 are you aborting the experiment or are you trying to protect crash? Is it like a callback functionality you are looking for?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

SuccessfulKoala55 that's good to know. I moved the signal register handles above the call to Task.init() as you suggested. This is what I should be seeing when the script is terminated manually:

I0526 07:46:14.391154 140262441822016 engine.py:837] Engine run starting with max_epochs=100. I0526 07:46:14.542132 140262441822016 train_utils.py:223] Epoch[1] Iter[1] Loss: 0.43599218130111694 I0526 07:46:24.078526 140262441822016 train_utils.py:46] 2 signal intercepted. I0526 07:46:24.078753 140262441822016 engine.py:635] Terminate signaled. Engine will stop after current iteration is finished.
However what I see is the following:
I0526 07:44:15.416634 140574824470336 engine.py:837] Engine run starting with max_epochs=100. I0526 07:44:15.517145 140574824470336 train_utils.py:223] Epoch[1] Iter[1] Loss: 0.43599218130111694 2020-05-26 07:44:36 User aborted: stopping task (1)Once the task is aborted there doesn't seem to be any more log output from the script. That might be because trains is cutting off the log, but I also don't see the model file saved anywhere. I'll keep looking, but thank you for the suggestion!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SillyPuppy19
				
					0
					 × 1

Sounds good AgitatedDove14 . I'll get an issue started. Thanks for the discussion!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SillyPuppy19
				
					0
					 × 1

Write your answer

995 Views

10 Answers

4 years ago

one year ago