Hi All, I Have A Newbie Question About Clear-Ml Data. I Have Four Data Sources That Get Combined To Train A Model. I Have Put Each Of These Datasets Into Clear Ml So That I Can Track Their Versions, And Then Create The Fifth 'Combined' Dataset Using The I

Answered

Hi all,
I have a newbie question about clear-ml data. I have four data sources that get combined to train a model. I have put each of these datasets into clear ml so that I can track their versions, and then create the fifth 'combined' dataset using the ids from the others. It works great.

My question is the correct method to update that fifth dataset if one of the other datasets changes. Say for example I create a new version of dataset1 , what is the correct method for creating the updated version of combined_dataset ?

Use the ids from each of the datasets the same way I did the first time ( Dataset.create(..., parent_datasets = [dataset1.id, dataset2.id, ...] ). In which case, will this actually be a second version of combined_dataset?
Do the same as above, but include the previous version of the combined dataset id as a parent as well?
Just do Dataset.create(..., parent_datasets = [combined_dataset.id]) and assume that clearml will take care of the rest?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PerfectSwan93
				
					0
					 × 1

Votes Newest

Answers 2

Hello @<1604647689662763008:profile|PerfectSwan93> , I tend to agree with you , option one is the best given your use-case. If you keep the same name and project it will result in a version bump on the combined dataset, but it will not point to the previous combined dataset as a parent.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticShrimp49
				
					0

After playing around in a test project I'm pretty sure option 1 is right. There's no need for the previous combined id to be included because it doesn't inherit anything from that dataset. Happy to be corrected though

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					PerfectSwan93
				
					0
					 × 1

Write your answer

2K Views

2 Answers

2 years ago

Hi Clearml Team, Is There Best Practice To Improve Dataset'S Storage Efficiency? For Example, I Don'T Really Need All 5 Versions Of The Same Dataset Get Saved/Remembered, Is There A Way To Prune Old Versions Of Datasets To Be More Storage Efficient?

I'M Having Issues With The Fileserver, Datasets Are Successfully Uploaded But Downloading Models Or Previewing Thumbnails Of The Dataset Gives A 401. Looks Like Related To

Hi, I Have A Question Regarding Clearml Datasets. In The Web Ui, What Causes The "Content" Tab To Show A List Of The Files In The Dataset? It Used To Show Automatically, But Recently It Now Has "No Data To Show" Even Though All Files Are Definitely In The

Hello Everyone, I'M In A Bit Of Situation Where I Want To Optimize Getting Local Copies Of Datasets, By Making It So That If A Parent Dataset Already Exists Locally, The Child Simply Creates Symlinks To The Files It Shares With The Parent. However, I Can'

Hi All, I Have A Question Regarding Multiple Parents: I Have A Pipe That Runs On Multiple Datasets, And The Last Step Does Something On The Bulk Of Those Sets (The Thing Itself Is Not Important). Sometimes One Of The Parents Fails Or Skipped Due To A Prev

Hi All, Is There Any Functionality To Support Data Cards That Are Tied To Datasets? I See There Is A Reports Tab That Supports A Markdown Document That'S Viewable In The Ui -- That'S Kind Of What I'M Looking For, But As Far As I Can Tell Those Exist Only

Hi All, Quick Question About Clearml Datasets. Does Anyone Know If It Is Possible To Access (Could Just Be Paths To The Data In A Bucket) A Dataset Directly From S3, Instead Of Downloading A Local Copy? We Typically Store And Access Large Quantities Of

Hi Everyone! I'M Currently Using The Free Hosted Version (Open Source) Of Clearml. I'M Mainly Using Clearml-Data At To Manage Our Datasets At The Moment, And I'Ve Already Hit The Limit For The Free Metrics Storage. Since We Didn'T Store A Lot Of Metrics (

Hi Everyone, Is There An Option To Not Mount Cache Into The Docker Of An Clearml-Agent? E.G. Datasets Are Downloaded Into The Docker And Gone When The Docker Container Is Removed By The Agent.