I have got experiments training PyTorch networks on a remote compute run by clearml-agent .
I am using the Ignite framework to train image classification networks, and can see that it is nicely integrating with the clearml experiment by automatically logging metrics being reported by the tensorboard logger I have running to capture the training metrics.
To capture the model weights during training, I have replaced the standard Ignite model checkpointer that I have running to save the best model, with the clearml handler, ClearMLSaver() , creating artefacts in the clearml-server experiment object.
val_checkpointer = Checkpoint( {"model": self.model}, ClearMLSaver(), n_saved=1, score_function=score_function_acc, score_name="val_acc", filename_prefix='cub200_{}_ignite_best'.format(self.config.MODEL.MODEL_NAME), global_step_transform=global_step_from_engine(self.train_engine), ) self.evaluator.add_event_handler(Events.EPOCH_COMPLETED, val_checkpointer)
I can see the artefact being stored on the clearml-server object.
Does this mean the model weights are stored on the clearml-server file system?
Is there something I need to do get the model weights file (.pth) actually uploaded to the server?
Or is the design that data does not sit with the server, it only points to references of the file, which is why having a centralised storage container, like cloud storage, is a good thing to implement?
I haven't setup any external storage, such as Azure Blobstore, is this something that is recommended?
How do other people setup the storage for the clearml-server?