Hi, I Have A Future Roadmap Question On Clearml-Datasets. The Current Implementation Works Well For Small Datasets But Its Rather In Effective For Very Large Datasets. For Example, Let'S Say I Have 10 Million Images Just For The Training Dataset, And My T

Unanswered

Hi SubstantialElk6
quick update, once clearml 1.1 is out, we will push the clearml-data improvement, supporting chunks per version (i.e. packaging the changeset into multiple zip files, instead of a single one as the current version does).

regrading (1) storage limit server.

Ideally, we should be able to specify the batch size that we want to download, or even better, tie this in with the training by parallelising the data download, data preprocessing and batch trains.

With the next version you will be able to download partial dataset (i.e. only selected chunks), which should help with the issue.
That said, the best solution is to configure a shared cache foe all instances (both open-source and -Enterprise version support it, with some efficiency improvements on the enterprise version).

Inefficiency. The time to pull the images is the time when the GPU is not utilised.

This one can be solved with shared cache + pipeline step, refreshing the cache in the shared cache machine.
wdyt ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

325 Views

0 Answers

4 years ago

2 years ago