I Have Got Experiments Training Pytorch Networks On A Remote Compute Run By

Answered

I have got experiments training PyTorch networks on a remote compute run by clearml-agent .

I am using the Ignite framework to train image classification networks, and can see that it is nicely integrating with the clearml experiment by automatically logging metrics being reported by the tensorboard logger I have running to capture the training metrics.

To capture the model weights during training, I have replaced the standard Ignite model checkpointer that I have running to save the best model, with the clearml handler, ClearMLSaver() , creating artefacts in the clearml-server experiment object.

val_checkpointer = Checkpoint( {"model": self.model}, ClearMLSaver(), n_saved=1, score_function=score_function_acc, score_name="val_acc", filename_prefix='cub200_{}_ignite_best'.format(self.config.MODEL.MODEL_NAME), global_step_transform=global_step_from_engine(self.train_engine), ) self.evaluator.add_event_handler(Events.EPOCH_COMPLETED, val_checkpointer)
I can see the artefact being stored on the clearml-server object.
Does this mean the model weights are stored on the clearml-server file system?
Is there something I need to do get the model weights file (.pth) actually uploaded to the server?
Or is the design that data does not sit with the server, it only points to references of the file, which is why having a centralised storage container, like cloud storage, is a good thing to implement?

I haven't setup any external storage, such as Azure Blobstore, is this something that is recommended?
How do other people setup the storage for the clearml-server?

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Votes Newest

Answers 2

AgitatedDove14 Brilliant!
I will try this, thank you sir!

  				
Posted 
	3 years ago

					More  		
  Report
		
					VivaciousPenguin66
				
					0
					 × 1

Does this mean the model weights are stored on the clearml-server file system?

By default they are just logged (i.e. the local path is stored, but the file is not uploaded). If you want to automatically store the model, pass output_uri=True to the Task.init , or any object store / shared folder (e.g. output_uri=' s3://bucket/folder ' ). ClearML will automatically create a subfolder for the Task, and upload all models/artifacts to it.
task = Task.init(project_name='examples', task_name='ignite', output_uri=True)(Upload to the file server)

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

1K Views

2 Answers

3 years ago

2 years ago