Thanks, I think I could identify the issue. I opened a bug here: None
The problem is with the keras BackupAndRestore
callback, where clearml overwrites the local backup storage with a storage to the clearml server. In this case, however, the local storage is sufficient as this is only for continuing an interruption.
Yeah, it's because it's just hooking into the save operation and capturing the output, regardless of the parent call.
Depending on the framework you're using it'll just hook into the save model operation. Every time you save a model, which will probably happen every epoch for some subset of the training. If you want to do it with the existing framework you could change the checkpoint so that it only clones the best model in memory and saves the write operation for last. The risk with this is if the training crashes, you'll lose your best model.
Optionally, you could also disable the ClearML integration with your framework and manually specify when to sync everything to the server.
I'm still a bit new to the platform, I'd love to hear from others if there's another solution.