Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Is There Any Way To Get Just One Dataset Folder Of A Dataset? E.G. Only "Train" Or Only "Dev"?

Is there any way to get just one dataset folder of a Dataset? e.g. only "train" or only "dev"?

  
  
Posted 2 years ago
Votes Newest

Answers 7


Lately I've heard of groups that do slices of datasets for distributed training, or who "stream" data.

Hmm so maybe a "glob" alike parameter for get_local_copy(select_filter='subfolder/*') ?

  
  
Posted 2 years ago

Well, in my particular case the training data's got, like 200 subfolders, each with 2,000 files. I was just curious whether it was possible to pull down one of the subsets

  
  
Posted 2 years ago

I suppose I could upload 200 different "datasets", rather than one dataset with 200 folders in it, but then clearml-data search would have 200 entries in it? It seemed like a good idea to put them all in one at the time

  
  
Posted 2 years ago

Hi SmallDeer34 👋

The dataset task will download all the dataset when using clearml-data task, you have both in the same one?

  
  
Posted 2 years ago

Is there any way to get just one dataset folder of a Dataset? e.g. only "train" or only "dev"?

They are usually stored in the same "zip" so basically you have to download both folders anyhow, but I guess if this saves space we could add this functionality, wdyt?

  
  
Posted 2 years ago

It would certainly be nice to have. Lately I've heard of groups that do slices of datasets for distributed training, or who "stream" data.

  
  
Posted 2 years ago

Any reason to not have those as two datasets?

  
  
Posted 2 years ago