Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Clearml Pipelines Can Be Build From Tasks, Functions, And Decorated Functions, According To The Examples In

ClearML pipelines can be build from tasks, functions, and decorated functions, according to the examples in None . I am guessing there was a certain way of doing things, that allgegro.ai started with and then there was a pain that the other approaches solved. So, what are the rules-of-thumb when to use which approach?

  
  
Posted one year ago
Votes Newest

Answers 31


Hi @<1523704157695905792:profile|VivaciousBadger56>
No these are 3 different ways of building pipelines.
Creating from decorators is recommended when each component can be easily packages into a single function (every function can have an accompanying repository).
Here the idea it is very easy to write complex execution logic, basically the automagic does serialization/deserialization so you can write pipelines like you would code python.

Creating from Tasks is a good match if you need to just run Tasks in a DAG manner, where input and outputs are connected, and the execution logic is basically the DAG. It does mean the components (i.e. Tasks) are aware of the way inputs/outputs are passed (i.e. hyper parameters, artifacts / models etc.)

Creating from functions is a middle ground between these two approaches, it is DAG execution where each component is a standalone function (again, you can attach git repository and base docker image per component)

Does that help?

  
  
Posted one year ago

Also, creating from functions allows dynamic pipeline creation without requiring the tasks to pre-exist in ClearML, which is IMO the strongest point to make about it

  
  
Posted one year ago

No these are 3 different ways of building pipelines.

That is what I meant to say 🙂 , sorry for the confusion, @<1523701205467926528:profile|AgitatedDove14> .

@<1523701083040387072:profile|UnevenDolphin73> , your point is a strong one. What are clear situations in which pipelines can only be build from tasks, and not one of the other ways? An idea would be if the tasks are created from all kinds of - kind of - unrelated projects where the code that describes the pipeline does not have access to the code of (some of) the tasks. Is that a valid scenario, where one has to use "pipelines from tasks"?

What are clear points where one would not be able to use one or multiple of the three approaches?

  
  
Posted one year ago

I think -

  • Creating a pipeline from tasks is useful when you already ran some of these tasks in a given format, and you want to replicate the exact behaviour (ignoring any new code changes for example), while potentially changing some parameters.
  • From decorators - when the pipeline logic is very straightforward and you'd like to mostly leverage pipelines for parallel execution of computation graphs
  • From functions - as I described earlier :)
  
  
Posted one year ago

@<1523701083040387072:profile|UnevenDolphin73> : A big point for me is to reuse/cache those artifacts/datasets/models that need to be passed between the steps, but have been produced by colleagues' executions at some earlier point. So for example, let the pipeline be A(a) -> B(b) -> C(c), where A,B,C are steps and their code, excluding configurations/parameters, and a,b,c are the configurations/parameters. Then I might have the situation, that my colleague ran the pipeline A(a) -> B(b) -> C(c).

  • Scenario 1: I run A(a) -> B(b') -> C(c) and I want that A(a) is not rerun, but its result reused/cached and only B(b') -> C(c') is run.
  • Scenario 2: I run A(a) -> B'(b) -> C(c) and I want that A(a) is not rerun, but its result reused/cached and only B'(b) -> C(c) is run.
  • Scenario 3: I run A(a) -> B'(b) -> C(c) and I want that nothing is rerun. Here, I only want that changes to the configuration, but not the code is considered.Which of the pipelines can be used for which Scenario?
    (Yes, they are not academic hypothetical, I have those cases in real life.)
  
  
Posted one year ago

So caching results for steps with the same arguments is trivial. Ultimately I would say you can combine the task-based pipeline with a function-based pipeline to achieve such dynamic control as you specified in the first two scenarios.

About the third scenario I'm not sure. If the configuration has changed, shouldn't the relevant steps (the ones where the configuration changed and their dependent steps) be rerun?

At any case, I think if you stay away from the decorators, at the cost of a bit more coding, you can achieve your wishes.

  
  
Posted one year ago

Also full disclosure - I'm not part of the ClearML team and have only recently started using pipelines myself, so all of the above is just learnings from my own trials 😅

  
  
Posted one year ago

Scenario 1 & 2 are essentially the same from caching perspective (the face B != B` means they have different caching hashes, but in both cases are cached).
Scenario 3 is the basically removing the cache flag from those components.

Not sure if I'm missing something.

Back to the @<1523701083040387072:profile|UnevenDolphin73>

From decorators - when the pipeline logic is very straightforward ...

Actually I would disagree, the decorators should be used when the pipeline Logic is not a DAG, the component itself can be extremely complex, and the decorator function is just a way to start the "main" of the component, that can rely on a totally different codebase. The main difference in both Tasks & functions the pipeline logic is actually a DAG, where as with decorators the logic is free python code! this is really a game changer when you think about the capabilities, you can check results before deciding to continue, you can have adjustable loops and parallelization depending on arguments etc.

Last point on component caching, what I suggest is actually providing users the ability to control the cache "function". Right now (a bit simplified but probably accurate), this is equivalent to hashing of the following dict:

{"code": "code here", "container": "docker image", "container args": "docker args", "hyper-parameters": "key/value"}

We could allow users to add a function that get's this dict and returns a new dict that will be used for hashing. This way we will enable removing or changing of fields, like ignoring code, or some of the arguments, and having the ability to add new custom fields.

wdyt?

  
  
Posted one year ago

I'm not sure how the decorators achieve that; from the available examples and trials I've done, it seems that:

  • Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebase
  • The component code still needs to be self-composed (or, function component can also be quite complex)
  • Decorators do not allow any dynamic build, because you must know how the component are connected at decoration time
    With that said, it could be that the provided examples are overly simplistic. At the moment I do not see how one can check results before deciding to continue, ... have adjustable loops and parallelization depending on arguments , at least the latter half is more easily doable with the non-decorator approach.

Can you provide a more realistic code example @<1523701205467926528:profile|AgitatedDove14> ? I'd love to simplify and extend our usage of pipelines, but I haven't seen this functionality at all.

  
  
Posted one year ago

  • Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebaseNo you an specify a different code base, see here:
    None

  • The component code still needs to be self-composed (or, function component can also be quite complex)Well it can address the additional repo (it will be automatically added to the PYTHONPATH), and you can add auxilary functions (as long as they are part of the initial pipeline script), by passing them to helper_functions
    None

  • Decorators do not allow any dynamic build, because you must know how the component are connected at decoration timeWell this is like any other python code, you define the functions before you use them, but you do Not have to use them (this is the pipeline logic itself driving it). Like Any other python code, if you do not call a function (decorated one) it will not be executed.

With that said, it could be that the provided examples are overly simplistic.

For sure!

heck results before deciding to continue, ... have adjustable loops and parallelization depending on arguments

,

None
Rephrased as:

    X_train, X_test, y_train, y_test, some_value = step_two(data_frame)

    if int(some_value) > 1337:
        print("this is something special here, let's train another model")
        model = step_four(X_train*2, y_train*2)
    else:
        print('launch step three')
        model = step_three(X_train, y_train)

This code will be executed just like regular python function, only the return values are deferred, when the code casts to int (here explicitly so it is easier to see), the code execution wait for the function (component) to complete execution (on another machine), fetch the return value, test against the result and decide what to do.
Does that make sense ? basically python execution on multi-node in a transparent way (in scale)

  
  
Posted one year ago

Ah, you meant “free python code” in that sense. Sure, I see that. The repo arguments also exist for functions though.

Sorry for hijacking your thread @<1523704157695905792:profile|VivaciousBadger56>

  
  
Posted one year ago

@<1523701083040387072:profile|UnevenDolphin73> : No, I love it ❤ . Now, I just have to read everything 😄 .

  
  
Posted one year ago

The first scenario is you standard "the code stays the same, the configuration changes" for the second step. Here, I want
The second and third scenario is "the configuration stays the same, the code changes", this is the case, e.g., if code is refactored, but effectively does the same as before.

@<1523701083040387072:profile|UnevenDolphin73> , you wrote

About the third scenario I'm not sure. If the configuration has changed, shouldn't the relevant steps (the ones where the configuration changed and their dependent steps) be rerun?

I think this is a misunderstanding of my scenario.

In the second scenario I want a rerun, in the third not. For example,

  • in the second scenario, I might have not changed the results of the step, but my refactoring changed the speed considerably and this is something I measure.
  • in the third scenario, I might have not changed the results of the step and my refactoring just cleaned the code, but besides that, nothing substantially was changed. Thus I do not want a rerun.
    @<1523701205467926528:profile|AgitatedDove14> , your wrote

Scenario 1 & 2 are essentially the same from caching perspective (the face B != B` means they have different caching hashes, but in both cases are cached).
Scenario 3 is the basically removing the cache flag from those components.

I am not sure, but think, this is exactly not what I meant 😄 . The scenario 3 is lenient regarding when to reuse old results. Did my explanation in this post clarify, what I meant?

  
  
Posted one year ago

  • in the second scenario, I might have not changed the results of the step, but my refactoring changed the speed considerably and this is something I measure.
  • in the third scenario, I might have not changed the results of the step and my refactoring just cleaned the code, but besides that, nothing substantially was changed. Thus I do not want a rerun.Well, I would say then that in the second scenario it’s just rerunning the pipeline, and in the third it’s not running it at all 😄
    (I mean, in both the code may have changed, the only difference is that in the second case you’re interested in measuring it, and in the third, you’re not, so it sounds like a user-specific decision).

At any case, while I understand now what Martin meant, I still feel the function-based pipelines are the strongest option, because it sounds like you’re looking for a way to dynamically build your pipeline.

  
  
Posted one year ago

My current approach with pipelines basically looks like a GH CICD yaml config btw, so I give the user a lot of control on which steps to run, why, and how, and the default simply caches all results so as to minimize the number of reruns.

The user can then override and choose exactly what to do (or not do).

  
  
Posted one year ago

Last point on component caching, what I suggest is actually providing users the ability to control the cache "function". Right now (a bit simplified but probably accurate), this is equivalent to hashing of the following dict:

{"code": "code here", "container": "docker image", "container args": "docker args", "hyper-parameters": "key/value"}

We could allow users to add a function that get's this dict and returns a new dict that will be used for hashing. This way we will enable removing or changing of fields, like ignoring code, or some of the arguments, and having the ability to add new custom fields.

@<1523701205467926528:profile|AgitatedDove14> : Is the idea here the following? You want to use inversion-of-control such that I provide a function f to a component that takes the above dict an an input. Then I can do whatever I like inside the function f and return a different dict as output. If the output dict of f changes, the component is rerun; otherwise, the old output of the component is used?

I would like to add, but maybe, this is what you meant all along:

It would be great if you could search - among previously executed tasks - for a task which has the same f -output as my components f -output and use that old task's result; then, there is no new task created from the component-definition. Only if you cannot find such a task, the component is rerun as a new task. In other words, f is like a query for a task.

This would be an awesome and pretty streamlined feature. I like it not only because of its flexibility, but because you could get rid of other caching rules. I like it much, if an idea/concept is more general than other concepts, but also removes other concepts because of its generality.

  
  
Posted one year ago

@<1523701205467926528:profile|AgitatedDove14> , you wrote

  • Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebase

No you an specify a different code base, see here:

Is the code in this "other" repo downloaded to the agent's machine? Or is the component's code pushed to the machine on which the repository is?
If the second case is true: How is the other machine (on which the other repo is lying on) turned into an agent?

  
  
Posted one year ago

: What does

  • The component code still needs to be self-composed (or, function component can also be quite complex)

Well it can address the additional repo (it will be automatically added to the PYTHONPATH), and you can add auxilary functions (as long as they are part of the initial pipeline script), by passing them to

helper_functions

mean? Is it not possible that I call code that is somewhere else on my local computer and/or in my code base? That makes things a bit complicated if my current repository is not somehow available to the agent.

  
  
Posted one year ago

@<1523701083040387072:profile|UnevenDolphin73> , you wrote

Well, I would say then that in the second scenario it’s just rerunning the pipeline, and in the third it’s not running it at all

(I mean, in both the code may have changed, the only difference is that in the second case you’re interested in measuring it, and in the third,
you’re not, so it sounds like a user-specific decision).

Well, I would hope that in the second scenario step A is not rerun. Yes, in the third scenario, nothing is rerun. Your text in parenthesis is correct.


At any case, while I understand now what Martin meant, I still feel the function-based pipelines are the strongest option, because it sounds
like you’re looking for a way to dynamically build your pipeline.

I think you might be mistaken, because @<1523701205467926528:profile|AgitatedDove14> referred in None to the decorator approach, at the point where he rewrote the code. In terms of creating dynamic pipelines and cyclic graphs, the decorator approach seems the most powerful to me.

  
  
Posted one year ago

I guess it depends on what you'd like to configure.
Since we let the user choose parents, component name, etc - we cannot use the decorators. We also infer required packages at runtime (the autodetection based on import statements fails with a non-trivial namespace) and need to set that to all components, so the decorators do not work for us.

  
  
Posted one year ago

@<1523701083040387072:profile|UnevenDolphin73> : I am not sure who you mean by "user"? I am not aware that we are building an app... 😄 Do you mean a person that reruns the entire pipeline but with different parameters from the Web UI? But here, we are not able to let the "user" configure all those things.

Is there some other way - that does not require any coding - to build pipelines (I am not aware)?

Also, when I build pipelines via tasks, the (same) imports had to be done in each task as well.

  
  
Posted one year ago

Heh, my bad, the term "user" is very much ingrained in our internal way of working. You can think of it as basically any technically-inclined person in your team or company.

Indeed the options in the WebUI are too limited for our use case, so we're developed "apps" that take a yaml configuration file and build a matching pipeline.
With that, our users do not need to code directly, and we can offer much more fine control over the pipeline.

As for the imports, what I meant is that I encountered an issue with pigar (I think that's the name of the package) - a package that ClearML uses to infer the requirements for each component automagically.
In our case, we have a monorepo with a common namespace for all the modules. Pigar fails to properly identify those, so our Pipeline Generator© captures the packages available at runtime and forces all components to have the same requirements.
This results in a bit slower startup time, of course.

  
  
Posted one year ago

@<1523704157695905792:profile|VivaciousBadger56>

Is the idea here the following? You want to use inversion-of-control such that I provide a function

f

to a component that takes the above dict an an input. Then I can do whatever I like inside the function

f

and return a different dict as output. If the output dict of

f

changes, the component is rerun; otherwise, the old output of the component is used?

Yes exactly ! this way you can say "the code stayed the same, i.e. either ignore it when you compare/hash previous steps, or have a string that represent that change that is invariant to the factoring (not sure how one would do that, If i remember correctly this is NP-complete 😛 ) Anyhow just to explain how it works the new returned dict is then hashed, and that hash is used to look for previous runs, hence cached execution. Make sense ?

  
  
Posted one year ago

Is the code in this "other" repo downloaded to the agent's machine? Or is the component's code pushed to the machine on which the repository is?

Yes this repo is downloaded into the agent, so your code has access to it

  
  
Posted one year ago

mean? Is it not possible that I call code that is somewhere else on my local computer and/or in my code base? That makes things a bit complicated if my current repository is not somehow available to the agent.

I guess you can ignore this argument for the sake of simple discussion. If you need access to extra files/functions, just make sure you point the repo argument to their repo, and the agent will make sure your code is running from the repo root, with all the repo files under it. Make sense ?

  
  
Posted one year ago

In terms of creating dynamic pipelines and cyclic graphs, the decorator approach seems the most powerful to me.

Yes that is correct, the decorator approach is the most powerful one, I agree.

  
  
Posted one year ago

@<1523701205467926528:profile|AgitatedDove14> Is it true that, when using the "pipeline from tasks" approach, my Python environment in which the pipeline is programmed, does not need to know any of the code with which the tasks have been programmed and still the respective pipeline would be executed just fine?

  
  
Posted one year ago

What is helper_functions in

  • None
  • Nonegood for?
    I do not find any example. The descriptions says

By default the pipeline step function has no access to any of the other functions, by specifying additional functions here, the remote pipeline step could call the additional functions.

Does this mean that within component or add_function_step I cannot use any code of my current directories code base, only code from external packages that are imported - unless I add my code with helper_functions ?

  
  
Posted one year ago

@<1523701205467926528:profile|AgitatedDove14> : I am writing quite a bit of documentation on the topic of pipelines. I am happy to share the article here, once my questions are answered and we can make a pull request for the official documentation out of it.

  
  
Posted one year ago

@<1523701205467926528:profile|AgitatedDove14> : In general: If I do not build a package out of my local repository/project , I cannot reference anything
from the local project/repository directly, right? I must make a package out of it, or I must reference it with the repo argument, or I must reference respective functions using the helper_functions argument. Did I get this right?

  
  
Posted one year ago
41K Views
31 Answers
one year ago
one year ago
Tags