Sure, but was wondering if it has more of a “first class citizen” status for tracking… e.g. something you can visualize in the UI or query via API
I mean, if it’s not tracked, I think it would be a good feature!
Re. “which task did I clone from” - to my understanding “parent’ field is used for “runtime parent” - i.e. what task started me.
This is not the same as “which task was I cloned from”
I think that in principal, if you “intercept” the calls to Model.get() or Dataset.get() from within a task, you can collect the ID’s and do various stuff with them. You can store and visualize it for lineage, or expose it as another hyper parameter I suppose.
You’ll just need the user to name them as part of loading them in the code (in case they are loading multiple datasets/models).
You’ll just need the user to
name them
as part of loading them in the code (in case they are loading multiple datasets/models).
Exactly! (and yes UI visualization is coming 🙂 )
RoughTiger69 So basically (If I follow your example), the question is whether ClearML "knows" Task B" is a clone of "Task A"?
And if the loaded Dataset Y, is somehow registered on Task X?
Is that correct?
Hi RoughTiger69
I like the direction this is taking, let me add some more complexity.
My thinking is that if we have “input datasets”, I'd also like to be able to clone the Task and automagically change them (with the need to export the dataset_id as an argument), basically I'm thinking :train = Datasset.get('aabbcc1', name='train') valid = Datasset.get('aabbcc2', name='validation') custom = Datasset.get('aabbcc3', name='custom')
Then you end up with HyperParameter Section: "Input Datasets”:train: aabbcc1
validation: aabbcc2
custom: aabbcc3
And then you can clone the Task in the UI, and edit the dataset ID and relaunch it, when now (without changing the code) you are changing the dataset your code is using.
wdyt?
RoughTiger69 , regarding the dataset loading, we are actually thinking of adding it as another "hyper parameter" section, and I think the idea came up a few times in the last month, so we should definitely do that. The question is how do we support multiple entries (i.e. two datasets loaded)? Should we force users to "name" the dataset when they "get it" ?
Regrading cloning, we had a lot of internal discussions on it, "Parent" is a field on a Task, so the information can be easily stored, the question is always, is a clone a child version of the parent? what happens of the parent has its own parent, are they siblings now? wdyt?
Task B is a clone of Taks A. Does B store the information that it was cloned from A somewhere?
You can add any user properties you like to any task, so maybe “origin” : <task_id> will do the work?
CostlyOstrich36 Lineage information for datasets - oversimplifying but bare with me:
Task should have a section called “input datasets”)
each time I do a Dataset.get() inside a current_task, add the dataset ID to this section
Same can work with InputModel()
This way you can have a full lineage graph (also queryable/visualizable)