Hello Everyone! A Question Regarding Uploading Model Weights As Artifacts. I Use

Answered

Hello everyone! A question regarding uploading model weights as artifacts. I use ClearMLSaver() and Checkpoint() functionality from PyTorch Ignite and upload models to s3. I utilise 2 savers in the training script: one to save off the last model, and the other one to save off the best model based on the metric value, and I noticed that if I am using the same filename_prefix in both Checkpoint() objects they use the same remote path and overwrite the model object there, even though one of the Checkpoint() objects has score_name parameter passed and it is being used to construct a full name of the file if it is being saved locally, however for the remote file it always creates the name with the following pattern: {filename_prefix}_checkpoint_{n}.pt , where n is from 0 to n_saved-1 ( n_saved is the number of objects that should be kept, parameter of the Checkpoint() class). This leads to what I described above, that different savers overwrite the same file, unless filename_prefix is different. Just wanted to ask if it is intended behaviour or I am using saver incorrectly here? Sorry in advance for a lengthy message and probably confusing explanation

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NonsensicalSparrow35
				
					0
					 × 1

Votes Newest

Answers 5

Just to demonstrate the workaround I described will attach an example from the UI on how it looks at the moment. Here I used 2 savers, with n_saved=2 , and filename_prefix=str(date.today()) + "_val_neg_img_loss" and filename_prefix=str(date.today()) , therefore there are 4 output models in total. If I wouldn't add "_val_neg_img_loss" to one prefix there would be only 2 models, even though (as you can see in the screenshot) in the model name the _val_neg_img_loss was used already because it is passed as score_name parameter

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NonsensicalSparrow35
				
					0
					 × 1

Sorry again for those walls of text. Just thought that detailed explanation of how model naming for remote models works with Ignite handlers could be helpful to somebody in the future (because I spent quite some time trying to figure out why what was working perfectly fine locally started to overwrite one another when I added output_uri )

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NonsensicalSparrow35
				
					0
					 × 1

And the last question on top of that (sorry!), regarding the concept of OUTPUT MODELS and MODEL NAMES. For this example, I only used one saver to save off 2 last checkpoints. When model is being uploaded for the first time the MODEL NAME in the UI is full and correct (as you can see in the first screenshot), but when it is being overwritten in the following epochs it only shows name of the experiment in the MODEL NAME and therefore all the info which was stored in the filename (like epoch number, score value, etc. is being missed, and there is no clear way on how to restore it, except from just checking manually how many epochs there were, or, for example, on what epoch the score of the target metric was the lowest). So actually 2 questions, is it specific to ClearMLSaver() that in OUTPUT MODELS in the UI we have the following names {filename_prefix}_checkpoint_{n}.pt (where n is from 0 to n_saved-1 ) instead of {filename_prefix}_checkpoint_{epoch_number}.pt ? And would it be possible to keep full MODEL NAME during the training and get it updated every time saver overwrites the model.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NonsensicalSparrow35
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> I guess, the main issue is the lost of model name file especially in case when the model is being saved based on the metric value. As in the screenshots above, in the UI Model Name is being just an experiment name after the first epoch, and not the name of the actual model file (which is different from the stored file name on the server, got it). So to understand from what epoch these weights were saved off you would need manually go to model file General->Description->priority and then check what was the step with this value

  				
Posted 
	one year ago

					More
				  		
  Report
		
					NonsensicalSparrow35
				
					0
					 × 1

Hi @<1684010629741940736:profile|NonsensicalSparrow35>

however for the remote file it always creates the name with the following pattern:

{filename_prefix}checkpoint{n}.pt

..

Is this the main issue?
Notice that the model name (i.e. the entry on the Task itself) is not directly connected with the stored file name on the target file server (or S3)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

5 Answers

one year ago