Many of the dataset we work with are generated by SQL query.
The main question in these scenarios is, are those DB stable.
By that I mean, generally speaking DB serve applications, and from time to time they undergo migration (i.e. change in schema, more/less data etc).
The most stable way is to create a script that runs the SQL query, and creates a clearml dateset from it (that script becomes part of the Dataset, to have full tractability)
This means that creating a new Dataset version is basically running this script (or even a pipeline)
And the code itself always interacts with the "frozen" dataset version.
This means from a user perspective DB access is limited to the script (a lot less dangerous) , the data is immutable (so we are certain nothing changed under our feet), and the data itself is cached (i.e. accessing the dame Dataset on the same machine will not need any additional net/compute)
For larger datasets how economical is it to use ClearML vs a cloud storage provider?
you mean like DB as a service? or storing the data on object storage?
fyi: ClearML dataset can store the "frozen copy" on your cloud object storage