No these are 3 different ways of building pipelines.
That is what I meant to say đ , sorry for the confusion, @<1523701205467926528:profile|AgitatedDove14> .
@<1523701083040387072:profile|UnevenDolphin73> , your point is a strong one. What are clear situations in which pipelines can only be build from tasks, and not one of the other ways? An idea would be if the tasks are created from all kinds of - kind of - unrelated projects where the code that describes the pipeline does not have access to the code of (some of) the tasks. Is that a valid scenario, where one has to use "pipelines from tasks"?
What are clear points where one would not be able to use one or multiple of the three approaches?
I think -
- Creating a pipeline from tasks is useful when you already ran some of these tasks in a given format, and you want to replicate the exact behaviour (ignoring any new code changes for example), while potentially changing some parameters.
- From decorators - when the pipeline logic is very straightforward and you'd like to mostly leverage pipelines for parallel execution of computation graphs
- From functions - as I described earlier :)
Also, creating from functions allows dynamic pipeline creation without requiring the tasks to pre-exist in ClearML, which is IMO the strongest point to make about it
Scenario 1 & 2 are essentially the same from caching perspective (the face B != B` means they have different caching hashes, but in both cases are cached).
Scenario 3 is the basically removing the cache flag from those components.
Not sure if I'm missing something.
Back to the @<1523701083040387072:profile|UnevenDolphin73>
From decorators - when the pipeline logic is very straightforward ...
Actually I would disagree, the decorators should be used when the pipeline Logic is not a DAG, the component itself can be extremely complex, and the decorator function is just a way to start the "main" of the component, that can rely on a totally different codebase. The main difference in both Tasks & functions the pipeline logic is actually a DAG, where as with decorators the logic is free python code! this is really a game changer when you think about the capabilities, you can check results before deciding to continue, you can have adjustable loops and parallelization depending on arguments etc.
Last point on component caching, what I suggest is actually providing users the ability to control the cache "function". Right now (a bit simplified but probably accurate), this is equivalent to hashing of the following dict:
{"code": "code here", "container": "docker image", "container args": "docker args", "hyper-parameters": "key/value"}
We could allow users to add a function that get's this dict and returns a new dict that will be used for hashing. This way we will enable removing or changing of fields, like ignoring code, or some of the arguments, and having the ability to add new custom fields.
wdyt?
Also full disclosure - I'm not part of the ClearML team and have only recently started using pipelines myself, so all of the above is just learnings from my own trials đ
@<1523701083040387072:profile|UnevenDolphin73> , you wrote
Well, I would say then that in the second scenario itâs just rerunning the pipeline, and in the third itâs not running it at all
(I mean, in both the code may have changed, the only difference is that in the second case youâre interested in measuring it, and in the third,
youâre not, so it sounds like a user-specific decision).
Well, I would hope that in the second scenario step A is not rerun. Yes, in the third scenario, nothing is rerun. Your text in parenthesis is correct.
At any case, while I understand now what Martin meant, I still feel the function-based pipelines are the strongest option, because it sounds
like youâre looking for a way to dynamically build your pipeline.
I think you might be mistaken, because @<1523701205467926528:profile|AgitatedDove14> referred in None to the decorator approach, at the point where he rewrote the code. In terms of creating dynamic pipelines and cyclic graphs, the decorator approach seems the most powerful to me.
: What does
- The component code still needs to be self-composed (or, function component can also be quite complex)
Well it can address the additional repo (it will be automatically added to the PYTHONPATH), and you can add auxilary functions (as long as they are part of the initial pipeline script), by passing them to
helper_functions
mean? Is it not possible that I call code that is somewhere else on my local computer and/or in my code base? That makes things a bit complicated if my current repository is not somehow available to the agent.
@<1523701205467926528:profile|AgitatedDove14> , you wrote
- Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebase
No you an specify a different code base, see here:
Is the code in this "other" repo downloaded to the agent's machine? Or is the component's code pushed to the machine on which the repository is?
If the second case is true: How is the other machine (on which the other repo is lying on) turned into an agent?
I guess it depends on what you'd like to configure.
Since we let the user choose parents, component name, etc - we cannot use the decorators. We also infer required packages at runtime (the autodetection based on import statements fails with a non-trivial namespace) and need to set that to all components, so the decorators do not work for us.
Heh, my bad, the term "user" is very much ingrained in our internal way of working. You can think of it as basically any technically-inclined person in your team or company.
Indeed the options in the WebUI are too limited for our use case, so we're developed "apps" that take a yaml configuration file and build a matching pipeline.
With that, our users do not need to code directly, and we can offer much more fine control over the pipeline.
As for the imports, what I meant is that I encountered an issue with pigar (I think that's the name of the package) - a package that ClearML uses to infer the requirements for each component automagically.
In our case, we have a monorepo with a common namespace for all the modules. Pigar fails to properly identify those, so our Pipeline Generator© captures the packages available at runtime and forces all components to have the same requirements.
This results in a bit slower startup time, of course.
@<1523701083040387072:profile|UnevenDolphin73> : I am not sure who you mean by "user"? I am not aware that we are building an app... đ Do you mean a person that reruns the entire pipeline but with different parameters from the Web UI? But here, we are not able to let the "user" configure all those things.
Is there some other way - that does not require any coding - to build pipelines (I am not aware)?
Also, when I build pipelines via tasks, the (same) imports had to be done in each task as well.
mean? Is it not possible that I call code that is somewhere else on my local computer and/or in my code base? That makes things a bit complicated if my current repository is not somehow available to the agent.
I guess you can ignore this argument for the sake of simple discussion. If you need access to extra files/functions, just make sure you point the repo
argument to their repo, and the agent will make sure your code is running from the repo root, with all the repo files under it. Make sense ?
Is the code in this "other" repo downloaded to the agent's machine? Or is the component's code pushed to the machine on which the repository is?
Yes this repo is downloaded into the agent, so your code has access to it
In terms of creating dynamic pipelines and cyclic graphs, the decorator approach seems the most powerful to me.
Yes that is correct, the decorator approach is the most powerful one, I agree.
@<1523701205467926528:profile|AgitatedDove14> : In general: If I do not build a package out of my local repository/project , I cannot reference anything
from the local project/repository directly, right? I must make a package out of it, or I must reference it with the repo
argument, or I must reference respective functions using the helper_functions
argument. Did I get this right?
The first scenario is you standard "the code stays the same, the configuration changes" for the second step. Here, I want
The second and third scenario is "the configuration stays the same, the code changes", this is the case, e.g., if code is refactored, but effectively does the same as before.
@<1523701083040387072:profile|UnevenDolphin73> , you wrote
About the third scenario I'm not sure. If the configuration has changed, shouldn't the relevant steps (the ones where the configuration changed and their dependent steps) be rerun?
I think this is a misunderstanding of my scenario.
In the second scenario I want a rerun, in the third not. For example,
- in the second scenario, I might have not changed the results of the step, but my refactoring changed the speed considerably and this is something I measure.
- in the third scenario, I might have not changed the results of the step and my refactoring just cleaned the code, but besides that, nothing substantially was changed. Thus I do not want a rerun.
@<1523701205467926528:profile|AgitatedDove14> , your wrote
Scenario 1 & 2 are essentially the same from caching perspective (the face B != B` means they have different caching hashes, but in both cases are cached).
Scenario 3 is the basically removing the cache flag from those components.
I am not sure, but think, this is exactly not what I meant đ . The scenario 3 is lenient regarding when to reuse old results. Did my explanation in this post clarify, what I meant?
Ah, you meant âfree python codeâ in that sense. Sure, I see that. The repo arguments also exist for functions though.
Sorry for hijacking your thread @<1523704157695905792:profile|VivaciousBadger56>
@<1523701083040387072:profile|UnevenDolphin73> : No, I love it †. Now, I just have to read everything đ .
Last point on component caching, what I suggest is actually providing users the ability to control the cache "function". Right now (a bit simplified but probably accurate), this is equivalent to hashing of the following dict:
{"code": "code here", "container": "docker image", "container args": "docker args", "hyper-parameters": "key/value"}
We could allow users to add a function that get's this dict and returns a new dict that will be used for hashing. This way we will enable removing or changing of fields, like ignoring code, or some of the arguments, and having the ability to add new custom fields.
@<1523701205467926528:profile|AgitatedDove14> : Is the idea here the following? You want to use inversion-of-control such that I provide a function f
to a component that takes the above dict an an input. Then I can do whatever I like inside the function f
and return a different dict as output. If the output dict of f
changes, the component is rerun; otherwise, the old output of the component is used?
I would like to add, but maybe, this is what you meant all along:
It would be great if you could search - among previously executed tasks - for a task which has the same f
-output as my components f
-output and use that old task's result; then, there is no new task created from the component-definition. Only if you cannot find such a task, the component is rerun as a new task. In other words, f
is like a query for a task.
This would be an awesome and pretty streamlined feature. I like it not only because of its flexibility, but because you could get rid of other caching rules. I like it much, if an idea/concept is more general than other concepts, but also removes other concepts because of its generality.
What is helper_functions
in
By default the pipeline step function has no access to any of the other functions, by specifying additional functions here, the remote pipeline step could call the additional functions.
Does this mean that within component or add_function_step I cannot use any code of my current directories code base, only code from external packages that are imported - unless I add my code with helper_functions
?
@<1523701205467926528:profile|AgitatedDove14> Is it true that, when using the "pipeline from tasks" approach, my Python environment in which the pipeline is programmed, does not need to know any of the code with which the tasks have been programmed and still the respective pipeline would be executed just fine?
My current approach with pipelines basically looks like a GH CICD yaml config btw, so I give the user a lot of control on which steps to run, why, and how, and the default simply caches all results so as to minimize the number of reruns.
The user can then override and choose exactly what to do (or not do).
@<1523704157695905792:profile|VivaciousBadger56>
Is the idea here the following? You want to use inversion-of-control such that I provide a function
f
to a component that takes the above dict an an input. Then I can do whatever I like inside the function
f
and return a different dict as output. If the output dict of
f
changes, the component is rerun; otherwise, the old output of the component is used?
Yes exactly ! this way you can say "the code stayed the same, i.e. either ignore it when you compare/hash previous steps, or have a string that represent that change that is invariant to the factoring (not sure how one would do that, If i remember correctly this is NP-complete đ ) Anyhow just to explain how it works the new returned dict is then hashed, and that hash is used to look for previous runs, hence cached execution. Make sense ?
I am writing quite a bit of documentation on the topic of pipelines. I am happy to share the article here, once my questions are answered and we can make a pull request for the official documentation out of it.
Amazing please share once done, I will make sure we merge it into the docs!
Does this mean that within component or add_function_step I cannot use any code of my current directories code base, only code from external packages that are imported - unless I add my code with
helper_functions
?
Yes, I'll try to improve the docstring there.
- It is important to realize that each decorated funciton will end up packaged in a spereate script file, and that script file will be running on the remote machine
- To the above script you can add a repo, so that script file is running inside the repo.
- But let's assume that in the first script we want more than just the decorated funciton, aha! we add the additional functions in the
helper_functions
arguments, and these funcitons will also be part of the standalone script file with our component. does that make sense @<1523704157695905792:profile|VivaciousBadger56> ?
If I do
not
build a package out of my local repository/project , I cannot reference anything
No need to build a packge from the repo, just pass it to as the repo
args.
So for example:
@PipelineDecorator.component(return_values=['accuracy'], cache=True, task_type=TaskTypes.qc, repo="
")
def step_four(model, X_data, Y_data):
print("yey")
What will happen is the agent will pull the " None " into a target folder (say ~/code) then it will create a new file called "step_four.py" and add that to the same ~/code
folder.
Then it will run something like cd ~/code && PYTHONPATH=~/code python step_four.py
Make sense ?
- in the second scenario, I might have not changed the results of the step, but my refactoring changed the speed considerably and this is something I measure.
- in the third scenario, I might have not changed the results of the step and my refactoring just cleaned the code, but besides that, nothing substantially was changed. Thus I do not want a rerun.Well, I would say then that in the second scenario itâs just rerunning the pipeline, and in the third itâs not running it at all đ
(I mean, in both the code may have changed, the only difference is that in the second case youâre interested in measuring it, and in the third, youâre not, so it sounds like a user-specific decision).
At any case, while I understand now what Martin meant, I still feel the function-based pipelines are the strongest option, because it sounds like youâre looking for a way to dynamically build your pipeline.
So caching results for steps with the same arguments is trivial. Ultimately I would say you can combine the task-based pipeline with a function-based pipeline to achieve such dynamic control as you specified in the first two scenarios.
About the third scenario I'm not sure. If the configuration has changed, shouldn't the relevant steps (the ones where the configuration changed and their dependent steps) be rerun?
At any case, I think if you stay away from the decorators, at the cost of a bit more coding, you can achieve your wishes.
-
Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebaseNo you an specify a different code base, see here:
None -
The component code still needs to be self-composed (or, function component can also be quite complex)Well it can address the additional repo (it will be automatically added to the PYTHONPATH), and you can add auxilary functions (as long as they are part of the initial pipeline script), by passing them to
helper_functions
None -
Decorators do not allow any dynamic build, because you must know how the component are connected at decoration timeWell this is like any other python code, you define the functions before you use them, but you do Not have to use them (this is the pipeline logic itself driving it). Like Any other python code, if you do not call a function (decorated one) it will not be executed.
With that said, it could be that the provided examples are overly simplistic.
For sure!
heck results before deciding to continue, ... have adjustable loops and parallelization depending on arguments
,
None
Rephrased as:
X_train, X_test, y_train, y_test, some_value = step_two(data_frame)
if int(some_value) > 1337:
print("this is something special here, let's train another model")
model = step_four(X_train*2, y_train*2)
else:
print('launch step three')
model = step_three(X_train, y_train)
This code will be executed just like regular python function, only the return values are deferred, when the code casts to int (here explicitly so it is easier to see), the code execution wait for the function (component) to complete execution (on another machine), fetch the return value, test against the result and decide what to do.
Does that make sense ? basically python execution on multi-node in a transparent way (in scale)
Hi @<1523704157695905792:profile|VivaciousBadger56>
No these are 3 different ways of building pipelines.
Creating from decorators is recommended when each component can be easily packages into a single function (every function can have an accompanying repository).
Here the idea it is very easy to write complex execution logic, basically the automagic does serialization/deserialization so you can write pipelines like you would code python.
Creating from Tasks is a good match if you need to just run Tasks in a DAG manner, where input and outputs are connected, and the execution logic is basically the DAG. It does mean the components (i.e. Tasks) are aware of the way inputs/outputs are passed (i.e. hyper parameters, artifacts / models etc.)
Creating from functions is a middle ground between these two approaches, it is DAG execution where each component is a standalone function (again, you can attach git repository and base docker image per component)
Does that help?
I'm not sure how the decorators achieve that; from the available examples and trials I've done, it seems that:
- Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebase
- The component code still needs to be self-composed (or, function component can also be quite complex)
- Decorators do not allow any dynamic build, because you must know how the component are connected at decoration time
With that said, it could be that the provided examples are overly simplistic. At the moment I do not see how one cancheck results before deciding to continue, ... have adjustable loops and parallelization depending on arguments
, at least the latter half is more easily doable with the non-decorator approach.
Can you provide a more realistic code example @<1523701205467926528:profile|AgitatedDove14> ? I'd love to simplify and extend our usage of pipelines, but I haven't seen this functionality at all.
@<1523701083040387072:profile|UnevenDolphin73> : A big point for me is to reuse/cache those artifacts/datasets/models that need to be passed between the steps, but have been produced by colleagues' executions at some earlier point. So for example, let the pipeline be A(a) -> B(b) -> C(c), where A,B,C are steps and their code, excluding configurations/parameters, and a,b,c are the configurations/parameters. Then I might have the situation, that my colleague ran the pipeline A(a) -> B(b) -> C(c).
- Scenario 1: I run A(a) -> B(b') -> C(c) and I want that A(a) is not rerun, but its result reused/cached and only B(b') -> C(c') is run.
- Scenario 2: I run A(a) -> B'(b) -> C(c) and I want that A(a) is not rerun, but its result reused/cached and only B'(b) -> C(c) is run.
- Scenario 3: I run A(a) -> B'(b) -> C(c) and I want that nothing is rerun. Here, I only want that changes to the configuration, but not the code is considered.Which of the pipelines can be used for which Scenario?
(Yes, they are not academic hypothetical, I have those cases in real life.)