Hello All, I Wanted To Get The Advice Of The People Here About Data Versioning And Tracking Using Clearml. Many Of The Dataset We Work With Are Generated By Sql Query. It’S Not Necessary To Generate Them Every Time But I’M Trying To Get Advice On How To

Answered

Hello all,

I wanted to get the advice of the people here about data versioning and tracking using ClearML. Many of the dataset we work with are generated by SQL query. It’s not necessary to generate them every time but I’m trying to get advice on how to manage the data versioning given the dataset isn’t loaded from a file but generated by query. Do people typically store the query results for data versioning? What are peoples suggestions/experience doing something similar.

For larger datasets how economical is it to use ClearML vs a cloud storage provider?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					EnthusiasticCow4
				
					0
					 × 1

Votes Newest

Answers

Hi @<1545216070686609408:profile|EnthusiasticCow4>

Many of the dataset we work with are generated by SQL query.

The main question in these scenarios is, are those DB stable.
By that I mean, generally speaking DB serve applications, and from time to time they undergo migration (i.e. change in schema, more/less data etc).
The most stable way is to create a script that runs the SQL query, and creates a clearml dateset from it (that script becomes part of the Dataset, to have full tractability)
This means that creating a new Dataset version is basically running this script (or even a pipeline)
And the code itself always interacts with the "frozen" dataset version.
This means from a user perspective DB access is limited to the script (a lot less dangerous) , the data is immutable (so we are certain nothing changed under our feet), and the data itself is cached (i.e. accessing the dame Dataset on the same machine will not need any additional net/compute)

For larger datasets how economical is it to use ClearML vs a cloud storage provider?

you mean like DB as a service? or storing the data on object storage?
fyi: ClearML dataset can store the "frozen copy" on your cloud object storage

wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

1 Answer

2 years ago