Hmmmm I couldn't find something in the SDK, however, you can use the API to do it
I plan to append the checkpoint to a list, when the len(list) > N, I'll just pop out the one with the highest loss, and delete that file from clearml and storage. That's how I plan to work with it.
I ran a training code from a github repo. It saves checkpoints every 2000 iterations. Only problem is I'm training it for 3200 epochs and there's more than 37000 iterations in each epoch. So the checkpoints just added up. I've stopped the training for now. I need to delete all of those checkpoints before I start training again.
VexedCat68
. So the checkpoints just added up. I've stopped the training for now. I need to delete all of those checkpoints before I start training again.
Are you uploading the checkpoints manually with artifacts? or is it autologged & uploaded ?
Also why no reuse and overwrite older checkpoints ?
I think it depends on your implementation. How are you currently implementing top X checkpoints logic?
shouldn't checkpoints be uploaded immediately, that's the purpose of checkpointing isn't it?
VexedCat68 , I was about to mention it myself. Maybe only keeping last few or last best checkpoints would be best in this case. I think SDK also supports this quite well 🙂
the storage is basically the machine the clearml server is on, not using s3 or anything
AgitatedDove14 CostlyOstrich36 I think that is the approach that'll work for me. I just need to be able to remove checkpoints I don't need given I know their name, from the UI and Storage.
Is there a difference? I mean my use case is pretty simple. I have a training and it basically creates a lot of checkpoints. I just want to keep the n best checkpoints and whenever there are more than N checkpoints, I'll delete the worst performing one. Deleted both locally and from the the task artifacts.
VexedCat68
delete the uploaded file, or the artifact from the Task ?
And given that I want have artifacts = task.get_registered_artifacts()
AgitatedDove14 Alright I think I understand, changes made in storage will be visible in the front end directly.
Will using Model.remove, completely delete from storage as well?
basically don't want the storage to be filled up on the ClearML Server machine.
Currently every 2000 iterations, a checkpoint is saved, that's just part of the code. Since output_uri = True, it gets uploaded to the ClearML server.
I think these are the relevant methods 🙂
https://clear.ml/docs/latest/docs/references/sdk/task#register_artifact
https://clear.ml/docs/latest/docs/references/sdk/task#unregister_artifact
And later you can use
https://clear.ml/docs/latest/docs/references/sdk/task#upload_artifact
When you have a finalized version of what you want
Since that is an immediate concern for me as well.
Also I need to modify the code to only keep the N best checkpoints as artifacts and remove others.
How do I go about uploading those registered artifacts, would I just pass artifacts[i] and the name for the artifact?
Hmm, you can delete the artifact with:task._delete_artifacts(artifact_names=['my_artifact']
However this will not delete the file itself.
Do delete the file I would do :remote_file = task.artifacts['delete_me'].url h = StorageHelper.get(remote_file) h.delete(remote_file) task._delete_artifacts(artifact_names=['delete_me']
Maybe we should have a proper interface for that? wdyt? what's the actual use case?
I need to both remove the artifact from the UI and the storage.
Will using Model.remove, completely delete from storage as well? (edited)
correct see argument delete_weights_file=True
Given a situation where I want delete an uploaded artifact from both the UI and the storage, how would I go about doing that?
VexedCat68 the remote checkpoints (i.e. Models) represent the local storage, so if you internally overwrite the files, this is exactly what will happen in the backend. so the following should work (and store the last 5 checkpoints):epochs += 1 torch.save("model_{}.pt",format(epochs % 5))
Regrading deleting / getting models:Model.remove(task.models['output'][-1])