Like AnxiousSeal95 says, clearml server will version a dataset for you and push it to a unified storage place, as well as make it differenceable.
I’ve written a workshop on how to train image classifiers for the problem of bird species identification and recently I’ve adapted it to work with clearml.
There is an example workbook on how to upload a dataset to clearml server, in this a directory of images. See here: https://github.com/ecm200/caltech_birds/blob/master/notebooks/clearml_add_new_dataset.ipynb
On the training script side, you need to make a local copy of the dataset before training. If you keep the same directory for cached datasets then clearml will check to see if the dataset version has changed, and if not it will used an already cached version. If it has, or it doesn’t exist, it will automatically download it. This is achieved as follows:
` # Get the dataset from the clearml-server and cache locally.
print('[INFO] Getting a local copy of the CUB200 birds datasets')
Train
train_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_train_dataset__AZURE_BLOB_VERSION')
print('[INFO] Default location of training dataset:: {}'.format(train_dataset.get_default_storage())
train_dataset_base = train_dataset.get_local_copy()
print('[INFO] Default location of training dataset:: {}'.format(train_dataset_base)) `
This code snippet will get the dataset cached locally.
The other thing you need to do then is to get the cached dataset locations before executing model training.
You can find the example in this training script which sets up a PyTorch Ingite training job on the clearml server. This can then be executed on remote compute by clearml-agents via the server queue, and the script will cache the dataset locally and then get the cached dataset locations, overriding the default local locations.
tf datasets is able to handle batch downloading quite well.
SubstantialElk6 I was not aware of that, I was under the impression tf dataset is accessed on a file level, no?
Hi thanks for the examples! I will look into them. Quite a fair bit of my teams uses tf datasets to pull data directly from object stores, so tfrecords and stuff are heavily involved. I'm trying to figure if they should version the raw data or the tfrecords with ClearML, and if downloading entire set of data to local can be avoided as tf datasets is able to handle batch downloading quite well.
To add onto what Martin wrote, you can see here: https://clear.ml/docs/latest/docs/guides/data%20management/data_man_cifar_classification
How it's interfaced with a torch dataloader. You only replace the path for where the files come from
Hi SubstantialElk6ClearML-Data
doesn't actually "load" the data, it brings it locally and returns a folder with all your data files, from that point onward, it's up to your code to load it to the framework. Make sense ?