Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Question About Clearml-Data. Clearml-Data Probably Does Well On Data Versioning, But When It Comes To Actual Loading Of Data, Are There Examples Of How It Can Make Use Of Advanced Features Such That Those In

Hi, i have a question about clearml-data. Clearml-Data probably does well on Data Versioning, but when it comes to actual loading of data, are there examples of how it can make use of advanced features such that those in torch.Dataloader and tf.data.Dataset ?

  
  
Posted 3 years ago
Votes Newest

Answers 5


Hi SubstantialElk6
ClearML-Data doesn't actually "load" the data, it brings it locally and returns a folder with all your data files, from that point onward, it's up to your code to load it to the framework. Make sense ?

  
  
Posted 3 years ago

Like AnxiousSeal95 says, clearml server will version a dataset for you and push it to a unified storage place, as well as make it differenceable.

I’ve written a workshop on how to train image classifiers for the problem of bird species identification and recently I’ve adapted it to work with clearml.

There is an example workbook on how to upload a dataset to clearml server, in this a directory of images. See here: https://github.com/ecm200/caltech_birds/blob/master/notebooks/clearml_add_new_dataset.ipynb

On the training script side, you need to make a local copy of the dataset before training. If you keep the same directory for cached datasets then clearml will check to see if the dataset version has changed, and if not it will used an already cached version. If it has, or it doesn’t exist, it will automatically download it. This is achieved as follows:

` # Get the dataset from the clearml-server and cache locally.
print('[INFO] Getting a local copy of the CUB200 birds datasets')

Train

train_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_train_dataset__AZURE_BLOB_VERSION')
print('[INFO] Default location of training dataset:: {}'.format(train_dataset.get_default_storage())
train_dataset_base = train_dataset.get_local_copy()
print('[INFO] Default location of training dataset:: {}'.format(train_dataset_base)) `
This code snippet will get the dataset cached locally.

The other thing you need to do then is to get the cached dataset locations before executing model training.
You can find the example in this training script which sets up a PyTorch Ingite training job on the clearml server. This can then be executed on remote compute by clearml-agents via the server queue, and the script will cache the dataset locally and then get the cached dataset locations, overriding the default local locations.

See here: https://github.com/ecm200/caltech_birds/blob/master/scripts/train_clearml_pytorch_ignite_caltech_birds.py

  
  
Posted 3 years ago

Hi thanks for the examples! I will look into them. Quite a fair bit of my teams uses tf datasets to pull data directly from object stores, so tfrecords and stuff are heavily involved. I'm trying to figure if they should version the raw data or the tfrecords with ClearML, and if downloading entire set of data to local can be avoided as tf datasets is able to handle batch downloading quite well.

  
  
Posted 3 years ago

tf datasets is able to handle batch downloading quite well.

SubstantialElk6 I was not aware of that, I was under the impression tf dataset is accessed on a file level, no?

  
  
Posted 3 years ago

To add onto what Martin wrote, you can see here: https://clear.ml/docs/latest/docs/guides/data%20management/data_man_cifar_classification
How it's interfaced with a torch dataloader. You only replace the path for where the files come from

  
  
Posted 3 years ago
1K Views
5 Answers
3 years ago
one year ago
Tags