this is how I implemented it by myself. Looks like clearml functionality is quite opinionated and requires some tweaks every time I try to replace my own stuff with it
Well, you can simply do the following:
Start with top 3 models named top1, top2, top3 Keep all 3 in disk cache during run Build logic to rate new model during run depending on it's standing compared to top 3 Decide on new standing of top 3 Perform update_weights_package
on the relevant "new" top 3 models once per modelThis is only from the top of my head. I'm sure you could create something better without even the need to cache 3 models during the run
How are you saving your models? torch.save ("<MODEL_NAME>")
?
if the loss is lower than the best stored loss so far, add the new checkpoint and remove the top-4th
Strictly speaking, there is only one training task, but I want to keep top-3 best checkpoints for it all the time
If I keep track of 3 OutputModels
simultaneously, the weights would need to shift between them every epoch (like, updated weights for top-1, then top-1 becomes top-2, top-2 becomes top-3 etc)
is there a some sort of OutputModel.remove
method? Docs say there isn't
e.g. if I want to store only top-3 running best checkpoints
You mean you would like to delete an output model of a task if other models in the task surpass it?
CostlyOstrich36 thank you for the answer! Maybe I just can delete old models along with corresponding tasks, seems to be easier
If I'm not mistaken, models reflect the file names. So if you recycle the file names you recycle the models. So if you save torch.save(" http://top1.pt ") then later torch.save(" http://top2.pt ") and even later do torch.save(" http://top1.pt ") again, you will only have 2 OutputModels, not three. This way you can keep recycling the best models 🙂
This way I would want to keep track of 3 OutputModel
s and call update_weights
3 times every update - and probably do 2 redundant uploadings
if I just use plain boto3 to sync weights to/from S3, I just check how many files are stored in the location, and clear up the old ones
` clearml_name = os.path.basename(save_path)
output_model_best = OutputModel(
task=task,
name=clearml_name,
tags=['running-best'])
output_model_best.update_weights(
save_path,
upload_uri=params.clearml_aws_checkpoints,
target_filename=clearml_name
) `
hm, not quite clear how it is implemented. For example, this is how I do it now (explicitly)
Is there a way to simplify it with ClearML, not make it more complicated?