Hi, I Am Trying To Understand Clearml-Data And Only Found This Piece Of Article Explaining It.

Answered

Hi, i am trying to understand clearml-data and only found this piece of article explaining it. https://github.com/allegroai/clearml/blob/master/docs/datasets.md .

I'm wondering if there are more elaborate guides on scenarios. For example, i would be interested in knowing the codes that transformed the data so i can reproduce those changes for new data coming in. Does clearml-data somehow keep track of that as well?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Votes Newest

Answers 4

Hi Jax, I'm working on a few more examples of how to use clearml-data. should be released in a few weeks (with some other documentation updates). These however don't include the use case you're talking about. Would you care to elaborate more on that? Are you looking to store the code that created the data, in the execution part of the task that saves the data itself?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AnxiousSeal95
				
					0
					 × 1

Hi, Some walk around I thought of.. Btw, I havent tried . AnxiousSeal95 , your comments

1 ) Attach a clearml-task id to each new dataset-id
So in the future, when new data comes in, get the last data commit from the project(Dataset) and get the clearml-task for it. Then clone the clearml-task, and pass in the new data. The only downside, is the need to clone the cleaml-task.
Or alternatively
2) Attach a gitsha-id of the processing code to each new dataset-id.
This can't give the exact code but least , which snapshot of the code base was used.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					DeliciousBluewhale87
				
					0
					 × 1

Hi SubstantialElk6

but in terms of data provenance, its not clear how i can associate the data versions with the processes that created it.

I think DeliciousBluewhale87 ’s approach is what we are aiming for, but with code.
So using clearml-data from CLI is basically storing/versioning of files (with differentiable based storage etc, but still).
What ou are after (I think) is in your preprocessing code using the programtic Dataset class, to create the Dataset from code, this allows you to both have the storage capabilities and versioning, but also to couple it with the preprocessing code for provenance and automation.
The base assumption is that Dataset is always a Task (with artifacts and fancy interface), but a Task nonetheless, and this gives you all the capabilities of a Task, such as adding metrics/stats on the Data, automation with pipeline etc, but also the ability to later retrieve the data with simple CLI or code.
wdyt?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi erez, i think i would want to reference the code that transformed the data. Take for example, i received 10k images, i performed some transformation and save it as a next version before i split it up for my ML training. Some time later, i receive a new set of 10k images and wants to apply the same transformation and then append it to the previous 10k as another version. Clearml-data does well for the data-versioning part, but in terms of data provenance, its not clear how i can associate the data versions with the processes that created it.

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					SubstantialElk6
				
					0
					 × 1

Write your answer

2K Views

4 Answers

4 years ago

2 years ago