Hi, Can I Ask How I Can Make Clearml-Datasets In Comparison With Pytorch Datasets/Dataloader? In Particular, Pytorch Dataloaders Would Be Able To Batch Pull And Then Preprocess Data Using Multi-Cpus, Feed It Into The Training Loop And Achieve As High Util

Answered

Hi, can i ask how i can make Clearml-Datasets in comparison with PyTorch datasets/dataloader? In particular, pytorch dataloaders would be able to batch pull and then preprocess data using multi-cpus, feed it into the training loop and achieve as high utilisation of cpu and gpu at the same time. In the case of using clearml-datasets, what is the best practice of achieving this? Either with pytorch or anything that's built into ClearML?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 5

Hi SubstantialElk6 ,

That's an interesting idea. I think if you want to preprocess a lot of data I think the best would be using multiple datasets (each per process) or different versions of datasets. Although I think you can also pull specific chunks of dataset and then you can use just the one - I'm not sure about the last point.

What do you think?

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Although I think you can also pull specific chunks of dataset

How do you do that with clearml-data?

  				
Posted 
	2 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

SubstantialElk6 , I think this is what you're looking for:
https://clear.ml/docs/latest/docs/references/sdk/dataset#get_local_copy
Dataset.get_local_copy(..., part=X)

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Thanks CostlyOstrich36 , how do i know how is the parts indexed in the first place? Or rather, how is chunk and parts defined? Say in the context of images, videos, text documents...etc.

  				
Posted 
	2 years ago

					More  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

https://clear.ml/docs/latest/docs/references/sdk/dataset/#get_num_chunks
I think this might also be helpful. Gloss over the functions available in the documentation, I think you might find what you're looking for 🙂

  				
Posted 
	2 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

5 Answers

2 years ago