So, using ignite I do the following:task.phases['valid'].add_event_handler( Events.EPOCH_COMPLETED(every=1), Checkpoint(to_save, TrainsSaver(output_uri=self.checkpoint_path), 'best', n_saved=1, score_function=lambda x: task.phases['valid'].state.metrics[self.monitor] if self.monitor_mode == 'max' else -task.phases['valid'].state.metrics[self.monitor], score_name=self.monitor))
This means as you said: The last model saved with this checkpointing is the best model.
But since I passed a score_function, ignite will automatically append the value of it on the suffix, and I end up with: checkpoint_best_acc=
http://0.9.pt . I guess it's also implied that when you use the score_function
parameter, you're saving your best model (why would you use a score otherwise?).
However, as we discussed in the issue in the ignite repo, there is no way to have a checkpoint that is simply named checkpoint_best
at this moment. So, if I am to use the ignite Checkpoint to save the models, I have no power to change the suffix as an end user. But when I use it together with trains, I end up uploading to output_uri
all the best models ever saved, regardless of me defining n_saved=1
(because of the issue we're discussing in ignite)
Now on top of that I also havetask.phases['train'].add_event_handler( Events.EPOCH_COMPLETED(every=self.save_freq), Checkpoint(to_save, TrainsSaver(output_uri=self.checkpoint_path), 'epoch', n_saved=5, global_step_transform=global_step_from_engine(task.phases['train'])))
That saves checkpoints every 20 epochs as a backup (regardless of their accuracy).
That means, that I have no knowledge as to what type of checkpoint was the last torch.save
that was called. It might be a backup, it might be a best model it might be a backup every 20 epochs.
Based on all that discussion I'm actually thinking maybe for Trains and ignite integration, there should be a TrainsCheckpoint
instead of a TrainsSaver
, that takes care of everything properly with respect to trains. But not sure if that's something the ignite ppl would want.
WDYT?