Thank you so much, Martin. You are really nice. 🙂
One of them told me they translated complete pipelines of Luigi to tasks in ClearML. It can be a way of working, but you lack the flexibility of running tasks by themselves.
In my case, I need to understand ClearML better in order to make a decision (I mean whether using ClearML with another tool for designing pipelines or not).
And regarding the possible approach, I will say the same: I will try to understand ClearML better and then maybe I can articulate exactly the need.
In plain english, I would say: as long as data+code+parameters are versioned, let only rerun what really needs to be rerun, and save as much computing time as possible. A task in a pipeline (or by itself) should be rerun only if:
Input data has changed. Code has changed. Parameters have changed. Output does not exist. Note that "input data has changed" takes into account dependences: in a way it is a recursive check, that should be evaluated with care at the time of pipeline definition (or else be checked dynamically, in the course of the run). Well, I don't know if this makes sense for you... This is how I think pipelines should work, ideally, but I am not an expert at all!
in this week I have met at least two people combining ClearML with other tools (one with Kedro and the other with luigi)
I would love to hear how/what is the use case 🙂
If I run the pipeline twice, changing only parameters or code of taskB, ...
I'll start at the end, yes you can clone a pipeline in the UI (or from code) and instruct it to reuse previous runs.
Let's take our A+B example, Let's say I have a pipeline P, and it executed A and then B (which relies on A's outputs).
Now I have B' (Same B with newer code, for example), I can clone the orinial Pipeline execution P, and set it to "continue_pipeline: True" so it will reuse previously executed Tasks (i.e. plug them) and only run the New Tasks.
Does that make sense?
do you think it will be easy to modify programmatically the behaviour, by extending the pipeline class for example?
Sure, that is is the intention, and if this is something you think will be useful I'm all for PR-ing such features.
I'm assuming what we are saying is, the "add_step" is checking weather the step was already executed (not sure how exactly, of what i the logic), then if it is already "executed", plugin the Task.id of the executed Task. Is this what you had in mind?
Thanks a lot. Yes, in this week I have met at least two people combining ClearML with other tools (one with Kedro and the other with luigi). In the beginning, I would rather prefer sticking with ClearML alone, so that I won't need to learn more than one tool. But I don't discard trying this integration in the future if I find some benefits.
I am sorry but I did not fully understand your answer. Well, from what you say it seems that everything is very flexible and programmable, which is something that I like a lot!! But I have the remaining doubt of the skipping steps on reruns. Let's think we have a pipeline composed by taskA and taskB. TaskB takes the output of taskA to do further transformations. If I run the pipeline twice, changing only parameters or code of taskB, will taskA be run again or not? And if the default behaviour is to run taskA, do you think it will be easy to modify programmatically the behaviour, by extending the pipeline class for example? When tasks are long to compute I find this very convenient...
If the same Task is run with different parameters...
ShinyWhale52 sorry, I kind of missed that in the explanation
The pipeline will always* create a new copy (clone) of the original Task (step), then modify the step's inputs etc.
The idea is that you have the experiment management (read execution management) to create full transparancy into the pipelines and steps. Think of it as the missing part in a lot of pipelines platforms where after you executed the pipeline you need to further analyze the results compare to previous pipelines etc. Since ClearML has a built-in experiment manager we just use the same UI for that, meaning every Task in the pipeline is an "experiment" with full logging inputs outputs etc, meaning you can compare two pipeline steps and have the UI present the exact difference in configurations / inputs and results. It also allows you to manage the pipelines post execution, e.g. rename them move them into dedicated folders/projects, add tags etc. This means they are also searchable from the UI of programmatically 🙂
I think the comparison to Luigi (and Kedro) is a very interesting idea, since they present two different levels of automation.
It could be nice is to think of a scenario where they (Luigi / Kedro) could be combined with ClearML to offer the benefit of both worlds. wdyt?
Thank you very much, Martin. Yes, it makes sense and I can see this approach has also some benefits when compared to Kedro (for example, if you want to run only one node, you'd have to create one pipeline having only this node, and define inputs and outputs...). This kind of information is very valuable! And regarding the other stuff of Luigi I mentioned, how does ClearML address that? I mean: if the same Task is run with different parameters, do their artifacts collide or not? And on rerun of a pipeline, all tasks inside the pipeline are run again, or only the ones that should (because of changes in data, in code or in parameters...)?
Hi ShinyWhale52
Luigi's approach is basically an extension of a functional dag, where each node is a single function. Let's think of Kedro as extension of this approach.
With both the assumption is that a node is a single function (sometimes it really is) and we just want to create a meta execution path (i.e. the execution dag, quite similar to TF v1).
ClearML pipelines are a different story (in a way).
The main difference is that with ClearML each node is a Task, not a function. That means we assume it has an entire setup that needs to be created/stored, it has configuration and parameters, and it's output is stored as artifacts of the execution.
As a derivative each node is a stand-alone process, that already exists in the system (think debugging session or writing the code as the creating process of the Task). A pipeline is only responsible to create copies of the original Tasks and pass parameters between one to another (I'm a bit oversimplifying to make a point)
The underlying assumption is that each task is not seconds long but minutes and some time hours long.
Does that make sense to you?