Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
I Have An Issue With How Clearml Logs Checkpoints. We Have A Training Setup With Pytorch-Lightning + Clearml, Where We Use

I have an issue with how clearml logs checkpoints.

We have a training setup with pytorch-lightning + clearml, where we use lightning.pytorch.ModelCheckpoint for model checkpointing. Now, I would like to use clearml.OutputModel s for storing the model configuration and weights, but when I'm just using the ModelCheckpoint callback with top_k=1 (saving the best checkpoint, but with varying filenames) lightning figures out to remove the old checkpoint (say epoch_001.ckpt ) and store the new one ( epoch_002.ckpt ) instead. In ClearML, these two checkpoints show up as separate OutputModels and our fileserver gets overloaded.

I have written this simple extension of ModelCheckpoint that supports the single-checkpoint case, but I would rather it be something built into ClearML. Custom code is cool, but creates friction when you have to override lots of hidden functions

from lightning import LightningModule, Trainer
from lightning.pytorch.callbacks import ModelCheckpoint
from clearml import Task, OutputModel

class ClearMLModelCheckpoint(ModelCheckpoint):
    Callback that extends the functionality of the `ModelCheckpoint` callback
    for saving the best model during training using ClearML.
        The same as `ModelCheckpoint`.
    - Currently only supports saving a single model.

    def on_train_start(self, trainer: Trainer, pl_module: LightningModule) -> None:
        super().on_train_start(trainer, pl_module)
        task: Task = Task.current_task()
        model_config = task.artifacts["configuration"].get()["model"]
        self.output_model = OutputModel(task, config_dict=model_config)

    def _save_checkpoint(self, trainer: Trainer, filepath: str) -> None:
        super()._save_checkpoint(trainer, filepath)

Has anyone been working on something similar? Found a good solution for model checkpointing that doesn't include a parallel implementation of ModelCheckpoint ?

Posted 5 months ago
Votes Newest

Answers 3

Thanks for responding @<1523701087100473344:profile|SuccessfulKoala55> . Good question! One solution could be to create a new open-source project with lightning + clearml integrations and link it to the Lightning ecosystem-ci ; I believe most people use the basic tensorboard-logger with ClearML, but the extended usecase of a ClearML model checkpoint callback might make it valuable.

I guess one would have to disable auto-logging of pytorch checkpoints for the callback to work, so that would be a part of that solution.

It doesn't look like there is a precedent for including framework-specific loggers/callbacks within ClearML (like the pytorch-ignite logger).


Posted 5 months ago

The lightning folks won't include new loggers anymore (since mid-2022, see None ) 🙂

Posted 5 months ago

Hi @<1523701601770934272:profile|GiganticMole91> , this looks interesting, how do you think you'd like to see this included in ClearML?

Posted 5 months ago
3 Answers
5 months ago
5 months ago