Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hello All, I Wanted To Get The Advice Of The People Here About Data Versioning And Tracking Using Clearml. Many Of The Dataset We Work With Are Generated By Sql Query. It’S Not Necessary To Generate Them Every Time But I’M Trying To Get Advice On How To

Hello all,

I wanted to get the advice of the people here about data versioning and tracking using ClearML. Many of the dataset we work with are generated by SQL query. It’s not necessary to generate them every time but I’m trying to get advice on how to manage the data versioning given the dataset isn’t loaded from a file but generated by query. Do people typically store the query results for data versioning? What are peoples suggestions/experience doing something similar.

For larger datasets how economical is it to use ClearML vs a cloud storage provider?

  
  
Posted 12 months ago
Votes Newest

Answers


Hi @<1545216070686609408:profile|EnthusiasticCow4>

Many of the dataset we work with are generated by SQL query.

The main question in these scenarios is, are those DB stable.
By that I mean, generally speaking DB serve applications, and from time to time they undergo migration (i.e. change in schema, more/less data etc).
The most stable way is to create a script that runs the SQL query, and creates a clearml dateset from it (that script becomes part of the Dataset, to have full tractability)
This means that creating a new Dataset version is basically running this script (or even a pipeline)
And the code itself always interacts with the "frozen" dataset version.
This means from a user perspective DB access is limited to the script (a lot less dangerous) , the data is immutable (so we are certain nothing changed under our feet), and the data itself is cached (i.e. accessing the dame Dataset on the same machine will not need any additional net/compute)

For larger datasets how economical is it to use ClearML vs a cloud storage provider?

you mean like DB as a service? or storing the data on object storage?
fyi: ClearML dataset can store the "frozen copy" on your cloud object storage

wdyt?

  
  
Posted 12 months ago
650 Views
1 Answer
12 months ago
12 months ago
Tags