Two Simple Lineage Related Questions:

Answered

Two simple lineage related questions:
Task B is a clone of Taks A. Does B store the information that it was cloned from A somewhere? Training task X loads Dataset Y usingds = Dataset.get(dataset_id) ds.get_local_copy()Does http://clear.ml understand this as a dependency and track it as some sort of lineage?
Or do I need to report it somehow for the info to show up?

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

Votes Newest

Answers 14

so I think it will just be confusing

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

Task B is a clone of Taks A. Does B store the information that it was cloned from A somewhere?

You can add any user properties you like to any task, so maybe “origin” : <task_id> will do the work?

  				
Posted 
	3 years ago

					More  		
  Report
		
					TimelyPenguin76
				
					0
					 Administrator

I mean, if it’s not tracked, I think it would be a good feature!

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

Hi RoughTiger69
I like the direction this is taking, let me add some more complexity.
My thinking is that if we have “input datasets”, I'd also like to be able to clone the Task and automagically change them (with the need to export the dataset_id as an argument), basically I'm thinking :
train = Datasset.get('aabbcc1', name='train') valid = Datasset.get('aabbcc2', name='validation') custom = Datasset.get('aabbcc3', name='custom')Then you end up with HyperParameter Section: "Input Datasets”:
train: aabbcc1
validation: aabbcc2
custom: aabbcc3
And then you can clone the Task in the UI, and edit the dataset ID and relaunch it, when now (without changing the code) you are changing the dataset your code is using.
wdyt?

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

I think that in principal, if you “intercept” the calls to Model.get() or Dataset.get() from within a task, you can collect the ID’s and do various stuff with them. You can store and visualize it for lineage, or expose it as another hyper parameter I suppose.

You’ll just need the user to name them as part of loading them in the code (in case they are loading multiple datasets/models).

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

yep

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

Re. “which task did I clone from” - to my understanding “parent’ field is used for “runtime parent” - i.e. what task started me.
This is not the same as “which task was I cloned from”

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

CostlyOstrich36 Lineage information for datasets - oversimplifying but bare with me:
Task should have a section called “input datasets”)
each time I do a Dataset.get() inside a current_task, add the dataset ID to this section

Same can work with InputModel()

This way you can have a full lineage graph (also queryable/visualizable)

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

Sure, but was wondering if it has more of a “first class citizen” status for tracking… e.g. something you can visualize in the UI or query via API

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

RoughTiger69 So basically (If I follow your example), the question is whether ClearML "knows" Task B" is a clone of "Task A"?
And if the loaded Dataset Y, is somehow registered on Task X?
Is that correct?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

RoughTiger69 , regarding the dataset loading, we are actually thinking of adding it as another "hyper parameter" section, and I think the idea came up a few times in the last month, so we should definitely do that. The question is how do we support multiple entries (i.e. two datasets loaded)? Should we force users to "name" the dataset when they "get it" ?

Regrading cloning, we had a lot of internal discussions on it, "Parent" is a field on a Task, so the information can be easily stored, the question is always, is a clone a child version of the parent? what happens of the parent has its own parent, are they siblings now? wdyt?

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

You’ll just need the user to

name them

as part of loading them in the code (in case they are loading multiple datasets/models).

Exactly! (and yes UI visualization is coming 🙂 )

  				
Posted 
	3 years ago

					More  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

👍

  				
Posted 
	3 years ago

					More  		
  Report
		
					RoughTiger69
				
					0
					 × 1

RoughTiger69 thanks for the input 🙂

  				
Posted 
	3 years ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Write your answer

1K Views

14 Answers

3 years ago

2 years ago