ClearML FAQ | Hey, Is There Some Way / Workaround To Speed Up Working With Datasets With Large Number Of Files? Getting A Local Copy Of One Of Our Dataset With 70K Files Already Takes Longer Than Expected, But Working With A Dataset Of Around 100K Files That Has Multip

Answered

Hey, Is There Some Way / Workaround To Speed Up Working With Datasets With Large Number Of Files? Getting A Local Copy Of One Of Our Dataset With 70K Files Already Takes Longer Than Expected, But Working With A Dataset Of Around 100K Files That Has Multip

Hey, is there some way / workaround to speed up working with datasets with large number of files? Getting a local copy of one of our dataset with 70k files already takes longer than expected, but working with a dataset of around 100k files that has multiple parents is just unusable. Should we just avoid merging datasets for this many files? The datasets themselves are small, they're just split into a large number of files.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UpsetCrow72
				
					0
					 × 1

Votes Newest

Answers

Hello, I am a data engineer but new to clearml.
If you train in batches then you should only get acces to the batch of document in those 100k. You could use s3 and implement the fetch in the get_item method :)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EagerGiraffe33
				
					0
					 × 1

Write your answer

2K Views

1 Answer

2 years ago

2 years ago

Tags

Similar posts

Hello! Is There Any Way To Use Original Files In Cleaml Datasets ? I Have Batch Of Tar Archives And Want To Create Dataset From Them, However Clearml Compresses Them. I Tried To Use

Hi, I Have A Question Regarding Clearml Datasets. In The Web Ui, What Causes The "Content" Tab To Show A List Of The Files In The Dataset? It Used To Show Automatically, But Recently It Now Has "No Data To Show" Even Though All Files Are Definitely In The

Hello Everyone, I'M In A Bit Of Situation Where I Want To Optimize Getting Local Copies Of Datasets, By Making It So That If A Parent Dataset Already Exists Locally, The Child Simply Creates Symlinks To The Files It Shares With The Parent. However, I Can'

Hello, Im Having Huge Performance Issues On Large Clearml Datasets How Can I Link To Parent Dataset Without Parent Dataset Files. I Want To Create A Smaller Subset Of Parent Dataset, Like 5% Of It. To Achieve This, I Have To Call Remove_Files() To 60K It

Hi, I'M Having A Hard Time Uploading Files As Metadata To Datasets. I Need To Log A Dictionary With Preserved Order, Clearml Orders The Saved Dictionary And There Is No Control Of The User On This Behavior. Hence, I'M Creating A Json File And Log It To My

Hi There, I Am Intending To Work More Often With The Datasets, But Not Sure If There Is A Way To Retrieve Specific Files From A Uploaded Dataset. I Saw I Can Retrieve Chunks Of Data, But Not Sure How That Would Work With A Dataset Of Parquet Files. If I H

Question On Using Clearml-Data To Manage Contents Of Datasets. I’M Having An Issue Deleting A Directory Within A Dataset Uploaded. Here Are A Few Ways I’Ve Tried, Create New Dataset With Parent, Remove --Files <Path To Folder>. That Doesn’T Work, Only