Hi, I Have A Future Roadmap Question On Clearml-Datasets. The Current Implementation Works Well For Small Datasets But Its Rather In Effective For Very Large Datasets. For Example, Let'S Say I Have 10 Million Images Just For The Training Dataset, And My T

Answered

Hi, i have a future roadmap question on clearml-datasets. The current implementation works well for small datasets but its rather in effective for very large datasets. For example, let's say i have 10 million images just for the training dataset, and my training batch is only say 64. I would have to pull the entire 10 million images if i execute Dataset.get(dataset_id='myid').get_local_copy() . This causes 2 issues.
Storage limits on training server Inefficiency. The time to pull the images is the time when the GPU is not utilised.
Ideally, we should be able to specify the batch size that we want to download, or even better, tie this in with the training by parallelising the data download, data preprocessing and batch trains.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 6

Thanks AgitatedDove14 , will take a look.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Would you have an example of this in your code blogs to demonstrate this utilisation?

Yes! I definitely think this is important, and hopefully we will see something there 🙂 (or at least in the docs)

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

SubstantialElk6 I just realized 3 weeks passed, wow!
So the good news we have some new examples:
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_functions.py
The bad news the documentation was postponed a bit, as we are still messaging the interface (the community is constantly pushing for great ideas and uses cases , and they are just too good to miss out 🙂 )
We added nested components and call backs and a metric/artifacts/model auto logging
https://github.com/allegroai/clearml/blob/b010f775bdd72ba6729f5e1e569626692d7b18af/clearml/automation/controller.py#L454

I'm hopeful that we will be able to push an initial version next week.
Please ping if you hear nothing, we appreciate it, and it really helps with prioritizing things 🙂

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes! I definitely think this is important, and hopefully we will see something there

(or at least in the docs)

Hi AgitatedDove14 , any updates in the docs to demonstrate this yet?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

This one can be solved with shared cache + pipeline step, refreshing the cache in the shared cache machine.

Would you have an example of this in your code blogs to demonstrate this utilisation?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Hi SubstantialElk6
quick update, once clearml 1.1 is out, we will push the clearml-data improvement, supporting chunks per version (i.e. packaging the changeset into multiple zip files, instead of a single one as the current version does).

regrading (1) storage limit server.

Ideally, we should be able to specify the batch size that we want to download, or even better, tie this in with the training by parallelising the data download, data preprocessing and batch trains.

With the next version you will be able to download partial dataset (i.e. only selected chunks), which should help with the issue.
That said, the best solution is to configure a shared cache foe all instances (both open-source and -Enterprise version support it, with some efficiency improvements on the enterprise version).

Inefficiency. The time to pull the images is the time when the GPU is not utilised.

This one can be solved with shared cache + pipeline step, refreshing the cache in the shared cache machine.
wdyt ?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

6 Answers

4 years ago

2 years ago