Ok, then maybe it can be still used as a data versioning solution. Except that I have to manually track the task id (those generate artifact) for versioning myself.
I am interested in machine learning experiment mangament tools.
I understand Trains already handle a lot of things on the model side, i.e. hyperparameters, logging, metrics, compare two experiments.
I also want it to help reproducible. To achieve that, I need code/data/configuration all tracked.
For code and configuration I am happy with current Trains solution, but I am not sure about the data versioning.
So if you have more details about the dataset versioning with the enterprise offer, I am interested to know more.
EnviousStarfish54 lets refine the discussion - are you looking at structured data (tables etc.) or unstructured (audio, images etc)
Hi EnviousStarfish54
The Enterprise edition extends Trains functionality.
It adds security, scale and full data management (data management and versioning being the key difference)
You can get it as a saas solution or on prem.
If you need more information, you can leave contact details on the website, I'm sure sales will be happy to help :)
EnviousStarfish54 data versioning on the open source leverages the artifacts and storage and caching capabilities of Trains.
A simple workflow
- Upload data
https://github.com/allegroai/events/blob/master/odsc20-east/generic/dataset_artifact.py - Preprocessing data
https://github.com/allegroai/events/blob/master/odsc20-east/generic/process_dataset.py - Using data
https://github.com/allegroai/events/blob/master/odsc20-east/scikit-learn/sklearn_jupyter.ipynb
for the open source version, if I use artifact, if I already have a local file, does it knows to skip downloading it or it will always replace the file? As my dataset is large (~100GBs), I cannot afford it to be re-downloaded everytime
Also, while we are at it, EnviousStarfish54 ,can I just make sure - you meant this page, right?
https://allegro.ai/enterprise/
I wonder what's the extra features is offered in the enterprise solution tho
I need to check something for you EnviousStarfish54 , I think one of our upcoming versions should have something to "write home about" in that regard
Do you know what is the "dataset management" for the open-source version?
EnviousStarfish54 that is the intention, it is cached. But you might need to manage your cache settings if you have many of those, since there is an initial sane setting for the cache size. Hope this helps.
EnviousStarfish54 I recognize this table 😉 i'm glad you are already talking with the right person. I hope you will get all your questions answered.
for the most common workflow, I may have some csv, which may be updated from time to time
potentially both, but let just say structure data first, like CSV, pickle (may not be a table, could be any python object), feather, parquet, some common data format
EnviousStarfish54 first of all, thanks for taking the time to explore our enterprise offering.
- Indeed Trains is completely standalone. The enterprise offering adds the necessary infrastructure for end-to-end integration etc. with a huge emphasis on computer vision related R&D.
- The data versioning is actually more than just data versioning because it adds an additional abstraction over the "dataset" concept, well this is something that the marketing guys should talk about... unless you want to hear more about how I view it - and just DM me here or on twitter https://twitter.com/LSTMeow
GrumpyPenguin23 yes, those features seems to related to other infrastructure, not Trains (ML experiment management)
AgitatedDove14
are the data versioning completely different from the Trains Artifact/storage solution? or it's some enhanced feature.
As I wrote before these are more geared towards unstructured data and I will feel more comfortable, as this is a community channel, if you continue your conversation with the enterprise rep. if you wish to take this thread to a more private channel I'm more than willing.