Hey There, I Am A New User Of Clearml And Really Enjoying It So Far! I Noticed That My Model Checkpoints Are Saved After Each Epoch. Instead I Would Like To Only Save The Best And Last Model Checkpoint. Is That Possible? I Could Not Find Something Regardi

Answered

Hey there, I am a new user of clearml and really enjoying it so far!
I noticed that my model checkpoints are saved after each epoch. Instead I would like to only save the best and last model checkpoint. Is that possible? I could not find something regarding this in the docs.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					NuttyKoala57
				
					0
					 × 1

Votes Newest

Answers 4

Hi @<1547390464557060096:profile|NuttyKoala57> ! You can use wildcards in auto_connect_framework to filter your models. Check the docs under init: None . You might also want to check out this GH thread for an another way to do this: None

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					SmugDolphin23
				
					0

Depending on the framework you're using it'll just hook into the save model operation. Every time you save a model, which will probably happen every epoch for some subset of the training. If you want to do it with the existing framework you could change the checkpoint so that it only clones the best model in memory and saves the write operation for last. The risk with this is if the training crashes, you'll lose your best model.

Optionally, you could also disable the ClearML integration with your framework and manually specify when to sync everything to the server.

I'm still a bit new to the platform, I'd love to hear from others if there's another solution.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Yeah, it's because it's just hooking into the save operation and capturing the output, regardless of the parent call.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Thanks, I think I could identify the issue. I opened a bug here: None

The problem is with the keras BackupAndRestore callback, where clearml overwrites the local backup storage with a storage to the clearml server. In this case, however, the local storage is sufficient as this is only for continuing an interruption.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					NuttyKoala57
				
					0
					 × 1

Write your answer

2K Views

4 Answers

2 years ago