Like AnxiousSeal95 says, clearml server will version a dataset for you and push it to a unified storage place, as well as make it differenceable.
I’ve written a workshop on how to train image classifiers for the problem of bird species identification and recently I’ve adapted it to work with clearml.
There is an example workbook on how to upload a dataset to clearml server, in this a directory of images. See here: https://github.com/ecm200/caltech_birds/blob/master/notebooks/clearml_add_new_dataset.ipynb
On the training script side, you need to make a local copy of the dataset before training. If you keep the same directory for cached datasets then clearml will check to see if the dataset version has changed, and if not it will used an already cached version. If it has, or it doesn’t exist, it will automatically download it. This is achieved as follows:
` # Get the dataset from the clearml-server and cache locally.
print('[INFO] Getting a local copy of the CUB200 birds datasets')
Train
train_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_train_dataset__AZURE_BLOB_VERSION')
print('[INFO] Default location of training dataset:: {}'.format(train_dataset.get_default_storage())
train_dataset_base = train_dataset.get_local_copy()
print('[INFO] Default location of training dataset:: {}'.format(train_dataset_base)) `
This code snippet will get the dataset cached locally.
The other thing you need to do then is to get the cached dataset locations before executing model training.
You can find the example in this training script which sets up a PyTorch Ingite training job on the clearml server. This can then be executed on remote compute by clearml-agents via the server queue, and the script will cache the dataset locally and then get the cached dataset locations, overriding the default local locations.