Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi, I Have A Future Roadmap Question On Clearml-Datasets. The Current Implementation Works Well For Small Datasets But Its Rather In Effective For Very Large Datasets. For Example, Let'S Say I Have 10 Million Images Just For The Training Dataset, And My T

Hi, i have a future roadmap question on clearml-datasets. The current implementation works well for small datasets but its rather in effective for very large datasets. For example, let's say i have 10 million images just for the training dataset, and my training batch is only say 64. I would have to pull the entire 10 million images if i execute Dataset.get(dataset_id='myid').get_local_copy() . This causes 2 issues.
Storage limits on training server Inefficiency. The time to pull the images is the time when the GPU is not utilised.
Ideally, we should be able to specify the batch size that we want to download, or even better, tie this in with the training by parallelising the data download, data preprocessing and batch trains.

  
  
Posted 3 years ago
Votes Newest

Answers 6


Thanks AgitatedDove14 , will take a look.

  
  
Posted 3 years ago

SubstantialElk6 I just realized 3 weeks passed, wow!
So the good news we have some new examples:
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_functions.py
The bad news the documentation was postponed a bit, as we are still messaging the interface (the community is constantly pushing for great ideas and uses cases , and they are just too good to miss out ๐Ÿ™‚ )
We added nested components and call backs and a metric/artifacts/model auto logging
https://github.com/allegroai/clearml/blob/b010f775bdd72ba6729f5e1e569626692d7b18af/clearml/automation/controller.py#L454

I'm hopeful that we will be able to push an initial version next week.
Please ping if you hear nothing, we appreciate it, and it really helps with prioritizing things ๐Ÿ™‚

  
  
Posted 3 years ago

Yes! I definitely think this is important, and hopefully we will see something thereย 

ย (or at least in the docs)

Hi AgitatedDove14 , any updates in the docs to demonstrate this yet?

  
  
Posted 3 years ago

Would you have an example of this in your code blogs to demonstrate this utilisation?

Yes! I definitely think this is important, and hopefully we will see something there ๐Ÿ™‚ (or at least in the docs)

  
  
Posted 3 years ago

This one can be solved with shared cache + pipeline step, refreshing the cache in the shared cache machine.

Would you have an example of this in your code blogs to demonstrate this utilisation?

  
  
Posted 3 years ago

Hi SubstantialElk6
quick update, once clearml 1.1 is out, we will push the clearml-data improvement, supporting chunks per version (i.e. packaging the changeset into multiple zip files, instead of a single one as the current version does).

regrading (1) storage limit server.

Ideally, we should be able to specify the batch size that we want to download, or even better, tie this in with the training by parallelising the data download, data preprocessing and batch trains.

With the next version you will be able to download partial dataset (i.e. only selected chunks), which should help with the issue.
That said, the best solution is to configure a shared cache foe all instances (both open-source and -Enterprise version support it, with some efficiency improvements on the enterprise version).

  1. Inefficiency. The time to pull the images is the time when the GPU is not utilised.

This one can be solved with shared cache + pipeline step, refreshing the cache in the shared cache machine.
wdyt ?

  
  
Posted 3 years ago