I like the direction this is taking, let me add some more complexity.
My thinking is that if we have “input datasets”, I'd also like to be able to clone the Task and automagically change them (with the need to export the dataset_id as an argument), basically I'm thinking :
train = Datasset.get('aabbcc1', name='train') valid = Datasset.get('aabbcc2', name='validation') custom = Datasset.get('aabbcc3', name='custom')Then you end up with HyperParameter Section: "Input Datasets”:
And then you can clone the Task in the UI, and edit the dataset ID and relaunch it, when now (without changing the code) you are changing the dataset your code is using.
I think that in principal, if you “intercept” the calls to Model.get() or Dataset.get() from within a task, you can collect the ID’s and do various stuff with them. You can store and visualize it for lineage, or expose it as another hyper parameter I suppose.
You’ll just need the user to name them as part of loading them in the code (in case they are loading multiple datasets/models).
CostlyOstrich36 Lineage information for datasets - oversimplifying but bare with me:
Task should have a section called “input datasets”)
each time I do a Dataset.get() inside a current_task, add the dataset ID to this section
Same can work with InputModel()
This way you can have a full lineage graph (also queryable/visualizable)
RoughTiger69 , regarding the dataset loading, we are actually thinking of adding it as another "hyper parameter" section, and I think the idea came up a few times in the last month, so we should definitely do that. The question is how do we support multiple entries (i.e. two datasets loaded)? Should we force users to "name" the dataset when they "get it" ?
Regrading cloning, we had a lot of internal discussions on it, "Parent" is a field on a Task, so the information can be easily stored, the question is always, is a clone a child version of the parent? what happens of the parent has its own parent, are they siblings now? wdyt?