Transform feature engineering and data processing code into recurring data ingestion workflows. Start building data stores, develop, automate, and schedule complex data processing jobs.
Create immutable and differentiable versions on-prem or in the cloud with our data agnostic solution.
Share data across R&D teams with searchable data catalogs available on any environment.
The first is probably done using pipeline controllers, the second using Datasets or HyperDatasets. Its not very clear how the last one is achieved, especially on the searchable data catalogs.
Hi SubstantialElk6
Generally speaking here, the idea is that actual code creates a Dataset (i.e. Dataset class created from code), plus you can add some metric reporting (like table reporting) to create a preview of the data stored for better visibility, or maybe create some statistics as part of the data ingest script. Then this ingest code can be relaunched / automated. The created Dataset itself can be tagged renamed added key/value for better cataloging. wdyt?
Yeah that'll cover the first two points, but I don't see how it'll end up as a dataset catalogue as advertised.
dataset catalogue as advertised.
Creating the Dataset on ClearML, is the catalog, you can move datasets around, put in sub-folders add tags add meta-data, search etc. I think this qualifies as a dataset catalog , no?
I see. Is there a more elaborate codeset that describes the above interactions?