Sorry again for those walls of text. Just thought that detailed explanation of how model naming for remote models works with Ignite handlers could be helpful to somebody in the future (because I spent quite some time trying to figure out why what was working perfectly fine locally started to overwrite one another when I added output_uri
)
@<1523701205467926528:profile|AgitatedDove14> I guess, the main issue is the lost of model name file especially in case when the model is being saved based on the metric value. As in the screenshots above, in the UI Model Name
is being just an experiment name after the first epoch, and not the name of the actual model file (which is different from the stored file name on the server, got it). So to understand from what epoch these weights were saved off you would need manually go to model file General->Description->priority and then check what was the step with this value
And the last question on top of that (sorry!), regarding the concept of OUTPUT MODELS and MODEL NAMES. For this example, I only used one saver to save off 2 last checkpoints. When model is being uploaded for the first time the MODEL NAME
in the UI is full and correct (as you can see in the first screenshot), but when it is being overwritten in the following epochs it only shows name of the experiment in the MODEL NAME
and therefore all the info which was stored in the filename (like epoch number, score value, etc. is being missed, and there is no clear way on how to restore it, except from just checking manually how many epochs there were, or, for example, on what epoch the score of the target metric was the lowest). So actually 2 questions, is it specific to ClearMLSaver()
that in OUTPUT MODELS
in the UI we have the following names {filename_prefix}_checkpoint_{n}.pt
(where n is from 0 to n_saved-1
) instead of {filename_prefix}_checkpoint_{epoch_number}.pt
? And would it be possible to keep full MODEL NAME
during the training and get it updated every time saver overwrites the model.
Hi @<1684010629741940736:profile|NonsensicalSparrow35>
however for the remote file it always creates the name with the following pattern:
{filename_prefix}checkpoint{n}.pt
..
Is this the main issue?
Notice that the model name (i.e. the entry on the Task itself) is not directly connected with the stored file name on the target file server (or S3)
Just to demonstrate the workaround I described will attach an example from the UI on how it looks at the moment. Here I used 2 savers, with n_saved=2
, and filename_prefix=str(date.today()) + "_val_neg_img_loss"
and filename_prefix=str(date.today())
, therefore there are 4 output models in total. If I wouldn't add "_val_neg_img_loss"
to one prefix there would be only 2 models, even though (as you can see in the screenshot) in the model name the _val_neg_img_loss
was used already because it is passed as score_name
parameter