I Have An Issue With How Clearml Logs Checkpoints. We Have A Training Setup With Pytorch-Lightning + Clearml, Where We Use

Answered

I have an issue with how clearml logs checkpoints.

We have a training setup with pytorch-lightning + clearml, where we use lightning.pytorch.ModelCheckpoint for model checkpointing. Now, I would like to use clearml.OutputModel s for storing the model configuration and weights, but when I'm just using the ModelCheckpoint callback with top_k=1 (saving the best checkpoint, but with varying filenames) lightning figures out to remove the old checkpoint (say epoch_001.ckpt ) and store the new one ( epoch_002.ckpt ) instead. In ClearML, these two checkpoints show up as separate OutputModels and our fileserver gets overloaded.

I have written this simple extension of ModelCheckpoint that supports the single-checkpoint case, but I would rather it be something built into ClearML. Custom code is cool, but creates friction when you have to override lots of hidden functions

from lightning import LightningModule, Trainer
from lightning.pytorch.callbacks import ModelCheckpoint
from clearml import Task, OutputModel

class ClearMLModelCheckpoint(ModelCheckpoint):
    """
    Callback that extends the functionality of the `ModelCheckpoint` callback
    for saving the best model during training using ClearML.
    Args:
        The same as `ModelCheckpoint`.
    Notes:
    - Currently only supports saving a single model.
    """

    def on_train_start(self, trainer: Trainer, pl_module: LightningModule) -> None:
        super().on_train_start(trainer, pl_module)
        task: Task = Task.current_task()
        model_config = task.artifacts["configuration"].get()["model"]
        self.output_model = OutputModel(task, config_dict=model_config)

    def _save_checkpoint(self, trainer: Trainer, filepath: str) -> None:
        super()._save_checkpoint(trainer, filepath)
        self.output_model.update_weights(filepath)

Has anyone been working on something similar? Found a good solution for model checkpointing that doesn't include a parallel implementation of ModelCheckpoint ?

  				
Posted 
	11 months ago

					More
				  		
  Report
		
					GiganticMole91
				
					0
					 × 1

Votes Newest

Answers 3

Hi @<1523701601770934272:profile|GiganticMole91> , this looks interesting, how do you think you'd like to see this included in ClearML?

  				
Posted 
	11 months ago

					More
				  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Thanks for responding @<1523701087100473344:profile|SuccessfulKoala55> . Good question! One solution could be to create a new open-source project with lightning + clearml integrations and link it to the Lightning ecosystem-ci ; I believe most people use the basic tensorboard-logger with ClearML, but the extended usecase of a ClearML model checkpoint callback might make it valuable.

I guess one would have to disable auto-logging of pytorch checkpoints for the callback to work, so that would be a part of that solution.

It doesn't look like there is a precedent for including framework-specific loggers/callbacks within ClearML (like the pytorch-ignite logger).

WDYT?

  				
Posted 
	11 months ago

					More
				  		
  Report
		
					GiganticMole91
				
					0
					 × 1

The lightning folks won't include new loggers anymore (since mid-2022, see None ) 🙂

  				
Posted 
	11 months ago

					More
				  		
  Report
		
					GiganticMole91
				
					0
					 × 1

Write your answer

781 Views

3 Answers

11 months ago