Hello. I Have Several Questions Regarding The Pipeline Components Of Clearml. I Have Read The Docs, But I Still Don'T Have A Clear Picture Of The Interplay Between Them. As I Know A Little Bit Better Luigi And Kedro, I Will Try To Explain How Are They Rel

Answered

Hello. I have several questions regarding the pipeline components of ClearML. I have read the docs, but I still don't have a clear picture of the interplay between them. As I know a little bit better Luigi and Kedro, I will try to explain how are they related and, if someone that knows ClearML could explain how they relate to the ClearML infraestructre, that would be very nice.

Let's start from the definitions. A pipeline is a DAG, in which:
Each node is a task. A task has inputs, made some operations on the inputs and generates outputs. A task can have parameters, that may modify the behaviour of the task. There are dependences among the tasks, that are represented as the arrows in the DAG.
How does Luigi solves that?

Well, Tasks are classes that implement three methods:
Requires, to set explicitly the dependences on previous tasks. Run, that is the method in which the task does its operations. Output, that is the method in which the task saves every output to be saved (let's think files in a filesystem). And also tasks have parameters, that are special classes of Luigi that under the hood behaves as kwargs of the init method of the class.

Inputs and outputs are instances of a class called Target. A Target is a class that implements, among others, two methods:
Save, to save whatever to disk. Open, to open whatever.
Luigi handles parameters very well in my opinion. It does several interesting things:
Each task has a signature, that is a hash of its parameters. So two instances of the same Task, with the same parameters, are "like" the same instance. It's similar to the singleton design pattern. The name of the outputs of a task can easily be made dependent on the signature. So, for example, if you had a Task that applies PCA to a dataframe, having the number of components as a parameter, then, if you run the task with n_components=10, and then with n_components=20, the signatures of those two tasks will be different, and so the outputs will be (if wanted) named different, and there will be no collisions. The parameters are inherited, so that if a task depends on previous tasks, it will inherit (in general, it can be avoided if needed) the parameters of all previous tasks that depends on it. If a task has its outputs generated, then it won't run, but it will pass the output to the next task. By doing so, re-computations are avoided (it is not that clever in the sense that it does not take into account data and code versioning...).
So, to make this more understable, let's think in a sequence of two tasks:
A -> B A generates an output oA, B reads oA and transforms it to oB A has a parameter n, with default value 0, and B has a parameter m, with default value 1.
Let's suppose they have not run yet. If I write this:

b = B()
b.run()

It will first run A with n=0, because B depends on A and A with n=0 has not run. Then it will run B with m=1.

Then if I write:

b = B(n=1)
b.run()

It will run A with n=1, because it has not run before. And then, it will run B with m=1, because, although it was run before, the input is now different because A has run with a different parameter. In this way, Luigi does it very good.

Finally:

b = B(n=1, m=2)

It won't run A with n=1, because it previously was run (outputs are persistent). But it will run B with parameter m=2.

I think is not a bad handling of dependences, inputs, outputs and parameters, although as I said it has its limitations. Mainly, no versioning of data and code.

How does Kedro solves that?

I don't want to make this too long to read. Basically there are nodes (like tasks), pipelines and a catalog of datasets. Nodes are pure functions, with inputs and outputs, that can depend on parameters (they are other inputs). The main difference is that Kedro decouples the node from the read/write operation, so to speak, and from the dependence. The node is only a function. It is in the pipeline definition where you specify what nodes you have, what elements of the catalog will be mapped to their inputs and outputs and that's it. In fact, de dependences are implicit; Kedro deduces them from the relationship between inputs and outputs. The catalog is like a dictionary: each dataset has a name, and the name is the key in the catalog, where the value is the actual instance of the object (datasets can be dataframes, files, images, models...).

So I find this decoupling valuable, as nodes are more reusable, and your project is, I think, better structured. However, there are benefits of Luigi that are lost. Mainly, the "signature", to make different parameters configuration have different outputs, and the thing with the reruning (kedro does run the pipeline from the beginning, wether the objects of each node are generated or not). It is possible, using Kedro Hooks, to adapt the behaviour to how Luigi operates (I have done it).

I like that Kedro has a lot of structure in the yaml files where you define the parameters, the configuration of the project and the catalog of datasets. This makes possible to have a nice visualization of the pipelines.

And pretty much that's it. How does ClearML handle all of this? Tasks, parameters, inputs, outputs, datasets, reruns... I want to understand it better!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ShinyWhale52
				
					0
					 × 1

Votes Newest

Answers 6

Thank you so much, Martin. You are really nice. 🙂

One of them told me they translated complete pipelines of Luigi to tasks in ClearML. It can be a way of working, but you lack the flexibility of running tasks by themselves.

In my case, I need to understand ClearML better in order to make a decision (I mean whether using ClearML with another tool for designing pipelines or not).

And regarding the possible approach, I will say the same: I will try to understand ClearML better and then maybe I can articulate exactly the need.

In plain english, I would say: as long as data+code+parameters are versioned, let only rerun what really needs to be rerun, and save as much computing time as possible. A task in a pipeline (or by itself) should be rerun only if:
Input data has changed. Code has changed. Parameters have changed. Output does not exist. Note that "input data has changed" takes into account dependences: in a way it is a recursive check, that should be evaluated with care at the time of pipeline definition (or else be checked dynamically, in the course of the run). Well, I don't know if this makes sense for you... This is how I think pipelines should work, ideally, but I am not an expert at all!

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ShinyWhale52
				
					0
					 × 1

Thank you very much, Martin. Yes, it makes sense and I can see this approach has also some benefits when compared to Kedro (for example, if you want to run only one node, you'd have to create one pipeline having only this node, and define inputs and outputs...). This kind of information is very valuable! And regarding the other stuff of Luigi I mentioned, how does ClearML address that? I mean: if the same Task is run with different parameters, do their artifacts collide or not? And on rerun of a pipeline, all tasks inside the pipeline are run again, or only the ones that should (because of changes in data, in code or in parameters...)?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ShinyWhale52
				
					0
					 × 1

If the same Task is run with different parameters...

ShinyWhale52 sorry, I kind of missed that in the explanation
The pipeline will always* create a new copy (clone) of the original Task (step), then modify the step's inputs etc.
The idea is that you have the experiment management (read execution management) to create full transparancy into the pipelines and steps. Think of it as the missing part in a lot of pipelines platforms where after you executed the pipeline you need to further analyze the results compare to previous pipelines etc. Since ClearML has a built-in experiment manager we just use the same UI for that, meaning every Task in the pipeline is an "experiment" with full logging inputs outputs etc, meaning you can compare two pipeline steps and have the UI present the exact difference in configurations / inputs and results. It also allows you to manage the pipelines post execution, e.g. rename them move them into dedicated folders/projects, add tags etc. This means they are also searchable from the UI of programmatically 🙂

I think the comparison to Luigi (and Kedro) is a very interesting idea, since they present two different levels of automation.
It could be nice is to think of a scenario where they (Luigi / Kedro) could be combined with ClearML to offer the benefit of both worlds. wdyt?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Thanks a lot. Yes, in this week I have met at least two people combining ClearML with other tools (one with Kedro and the other with luigi). In the beginning, I would rather prefer sticking with ClearML alone, so that I won't need to learn more than one tool. But I don't discard trying this integration in the future if I find some benefits.

I am sorry but I did not fully understand your answer. Well, from what you say it seems that everything is very flexible and programmable, which is something that I like a lot!! But I have the remaining doubt of the skipping steps on reruns. Let's think we have a pipeline composed by taskA and taskB. TaskB takes the output of taskA to do further transformations. If I run the pipeline twice, changing only parameters or code of taskB, will taskA be run again or not? And if the default behaviour is to run taskA, do you think it will be easy to modify programmatically the behaviour, by extending the pipeline class for example? When tasks are long to compute I find this very convenient...

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					ShinyWhale52
				
					0
					 × 1

in this week I have met at least two people combining ClearML with other tools (one with Kedro and the other with luigi)

I would love to hear how/what is the use case 🙂

If I run the pipeline twice, changing only parameters or code of taskB, ...

I'll start at the end, yes you can clone a pipeline in the UI (or from code) and instruct it to reuse previous runs.
Let's take our A+B example, Let's say I have a pipeline P, and it executed A and then B (which relies on A's outputs).
Now I have B' (Same B with newer code, for example), I can clone the orinial Pipeline execution P, and set it to "continue_pipeline: True" so it will reuse previously executed Tasks (i.e. plug them) and only run the New Tasks.
Does that make sense?

do you think it will be easy to modify programmatically the behaviour, by extending the pipeline class for example?

Sure, that is is the intention, and if this is something you think will be useful I'm all for PR-ing such features.
I'm assuming what we are saying is, the "add_step" is checking weather the step was already executed (not sure how exactly, of what i the logic), then if it is already "executed", plugin the Task.id of the executed Task. Is this what you had in mind?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Hi ShinyWhale52
Luigi's approach is basically an extension of a functional dag, where each node is a single function. Let's think of Kedro as extension of this approach.
With both the assumption is that a node is a single function (sometimes it really is) and we just want to create a meta execution path (i.e. the execution dag, quite similar to TF v1).
ClearML pipelines are a different story (in a way).
The main difference is that with ClearML each node is a Task, not a function. That means we assume it has an entire setup that needs to be created/stored, it has configuration and parameters, and it's output is stored as artifacts of the execution.
As a derivative each node is a stand-alone process, that already exists in the system (think debugging session or writing the code as the creating process of the Task). A pipeline is only responsible to create copies of the original Tasks and pass parameters between one to another (I'm a bit oversimplifying to make a point)
The underlying assumption is that each task is not seconds long but minutes and some time hours long.
Does that make sense to you?

  				
Posted 
	4 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

2K Views

6 Answers

4 years ago

2 years ago