Clearml Pipelines Can Be Build From Tasks, Functions, And Decorated Functions, According To The Examples In

I think -

Creating a pipeline from tasks is useful when you already ran some of these tasks in a given format, and you want to replicate the exact behaviour (ignoring any new code changes for example), while potentially changing some parameters.
From decorators - when the pipeline logic is very straightforward and you'd like to mostly leverage pipelines for parallel execution of computation graphs
From functions - as I described earlier :)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

So caching results for steps with the same arguments is trivial. Ultimately I would say you can combine the task-based pipeline with a function-based pipeline to achieve such dynamic control as you specified in the first two scenarios.

About the third scenario I'm not sure. If the configuration has changed, shouldn't the relevant steps (the ones where the configuration changed and their dependent steps) be rerun?

At any case, I think if you stay away from the decorators, at the cost of a bit more coding, you can achieve your wishes.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Scenario 1 & 2 are essentially the same from caching perspective (the face B != B` means they have different caching hashes, but in both cases are cached).
Scenario 3 is the basically removing the cache flag from those components.

Not sure if I'm missing something.

Back to the @<1523701083040387072:profile|UnevenDolphin73>

From decorators - when the pipeline logic is very straightforward ...

Actually I would disagree, the decorators should be used when the pipeline Logic is not a DAG, the component itself can be extremely complex, and the decorator function is just a way to start the "main" of the component, that can rely on a totally different codebase. The main difference in both Tasks & functions the pipeline logic is actually a DAG, where as with decorators the logic is free python code! this is really a game changer when you think about the capabilities, you can check results before deciding to continue, you can have adjustable loops and parallelization depending on arguments etc.

Last point on component caching, what I suggest is actually providing users the ability to control the cache "function". Right now (a bit simplified but probably accurate), this is equivalent to hashing of the following dict:

{"code": "code here", "container": "docker image", "container args": "docker args", "hyper-parameters": "key/value"}

We could allow users to add a function that get's this dict and returns a new dict that will be used for hashing. This way we will enable removing or changing of fields, like ignoring code, or some of the arguments, and having the ability to add new custom fields.

wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> : I am writing quite a bit of documentation on the topic of pipelines. I am happy to share the article here, once my questions are answered and we can make a pull request for the official documentation out of it.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

Is the code in this "other" repo downloaded to the agent's machine? Or is the component's code pushed to the machine on which the repository is?

Yes this repo is downloaded into the agent, so your code has access to it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

My current approach with pipelines basically looks like a GH CICD yaml config btw, so I give the user a lot of control on which steps to run, why, and how, and the default simply caches all results so as to minimize the number of reruns.

The user can then override and choose exactly what to do (or not do).

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

@<1523704157695905792:profile|VivaciousBadger56>

Is the idea here the following? You want to use inversion-of-control such that I provide a function

f

to a component that takes the above dict an an input. Then I can do whatever I like inside the function

f

and return a different dict as output. If the output dict of

f

changes, the component is rerun; otherwise, the old output of the component is used?

Yes exactly ! this way you can say "the code stayed the same, i.e. either ignore it when you compare/hash previous steps, or have a string that represent that change that is invariant to the factoring (not sure how one would do that, If i remember correctly this is NP-complete 😛 ) Anyhow just to explain how it works the new returned dict is then hashed, and that hash is used to look for previous runs, hence cached execution. Make sense ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebaseNo you an specify a different code base, see here:
None
The component code still needs to be self-composed (or, function component can also be quite complex)Well it can address the additional repo (it will be automatically added to the PYTHONPATH), and you can add auxilary functions (as long as they are part of the initial pipeline script), by passing them to helper_functions
None
Decorators do not allow any dynamic build, because you must know how the component are connected at decoration timeWell this is like any other python code, you define the functions before you use them, but you do Not have to use them (this is the pipeline logic itself driving it). Like Any other python code, if you do not call a function (decorated one) it will not be executed.

With that said, it could be that the provided examples are overly simplistic.

For sure!

heck results before deciding to continue, ... have adjustable loops and parallelization depending on arguments

,

None
Rephrased as:

    X_train, X_test, y_train, y_test, some_value = step_two(data_frame)

    if int(some_value) > 1337:
        print("this is something special here, let's train another model")
        model = step_four(X_train*2, y_train*2)
    else:
        print('launch step three')
        model = step_three(X_train, y_train)

This code will be executed just like regular python function, only the return values are deferred, when the code casts to int (here explicitly so it is easier to see), the code execution wait for the function (component) to complete execution (on another machine), fetch the return value, test against the result and decide what to do.
Does that make sense ? basically python execution on multi-node in a transparent way (in scale)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

: What does

The component code still needs to be self-composed (or, function component can also be quite complex)

Well it can address the additional repo (it will be automatically added to the PYTHONPATH), and you can add auxilary functions (as long as they are part of the initial pipeline script), by passing them to

helper_functions

mean? Is it not possible that I call code that is somewhere else on my local computer and/or in my code base? That makes things a bit complicated if my current repository is not somehow available to the agent.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

Ah, you meant “free python code” in that sense. Sure, I see that. The repo arguments also exist for functions though.

Sorry for hijacking your thread @<1523704157695905792:profile|VivaciousBadger56>

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Also full disclosure - I'm not part of the ClearML team and have only recently started using pipelines myself, so all of the above is just learnings from my own trials 😅

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

@<1523701083040387072:profile|UnevenDolphin73> , you wrote

Well, I would say then that in the second scenario it’s just rerunning the pipeline, and in the third it’s not running it at all

(I mean, in both the code may have changed, the only difference is that in the second case you’re interested in measuring it, and in the third,
you’re not, so it sounds like a user-specific decision).

Well, I would hope that in the second scenario step A is not rerun. Yes, in the third scenario, nothing is rerun. Your text in parenthesis is correct.

At any case, while I understand now what Martin meant, I still feel the function-based pipelines are the strongest option, because it sounds
like you’re looking for a way to dynamically build your pipeline.

I think you might be mistaken, because @<1523701205467926528:profile|AgitatedDove14> referred in None to the decorator approach, at the point where he rewrote the code. In terms of creating dynamic pipelines and cyclic graphs, the decorator approach seems the most powerful to me.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

No these are 3 different ways of building pipelines.

That is what I meant to say 🙂 , sorry for the confusion, @<1523701205467926528:profile|AgitatedDove14> .

@<1523701083040387072:profile|UnevenDolphin73> , your point is a strong one. What are clear situations in which pipelines can only be build from tasks, and not one of the other ways? An idea would be if the tasks are created from all kinds of - kind of - unrelated projects where the code that describes the pipeline does not have access to the code of (some of) the tasks. Is that a valid scenario, where one has to use "pipelines from tasks"?

What are clear points where one would not be able to use one or multiple of the three approaches?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

I'm not sure how the decorators achieve that; from the available examples and trials I've done, it seems that:

Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebase
The component code still needs to be self-composed (or, function component can also be quite complex)
Decorators do not allow any dynamic build, because you must know how the component are connected at decoration time
With that said, it could be that the provided examples are overly simplistic. At the moment I do not see how one can check results before deciding to continue, ... have adjustable loops and parallelization depending on arguments , at least the latter half is more easily doable with the non-decorator approach.

Can you provide a more realistic code example @<1523701205467926528:profile|AgitatedDove14> ? I'd love to simplify and extend our usage of pipelines, but I haven't seen this functionality at all.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

Also, creating from functions allows dynamic pipeline creation without requiring the tasks to pre-exist in ClearML, which is IMO the strongest point to make about it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

What is helper_functions in

None
Nonegood for?
I do not find any example. The descriptions says

By default the pipeline step function has no access to any of the other functions, by specifying additional functions here, the remote pipeline step could call the additional functions.

Does this mean that within component or add_function_step I cannot use any code of my current directories code base, only code from external packages that are imported - unless I add my code with helper_functions ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> Is it true that, when using the "pipeline from tasks" approach, my Python environment in which the pipeline is programmed, does not need to know any of the code with which the tasks have been programmed and still the respective pipeline would be executed just fine?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

The first scenario is you standard "the code stays the same, the configuration changes" for the second step. Here, I want
The second and third scenario is "the configuration stays the same, the code changes", this is the case, e.g., if code is refactored, but effectively does the same as before.

@<1523701083040387072:profile|UnevenDolphin73> , you wrote

About the third scenario I'm not sure. If the configuration has changed, shouldn't the relevant steps (the ones where the configuration changed and their dependent steps) be rerun?

I think this is a misunderstanding of my scenario.

In the second scenario I want a rerun, in the third not. For example,

in the second scenario, I might have not changed the results of the step, but my refactoring changed the speed considerably and this is something I measure.
in the third scenario, I might have not changed the results of the step and my refactoring just cleaned the code, but besides that, nothing substantially was changed. Thus I do not want a rerun.
@<1523701205467926528:profile|AgitatedDove14> , your wrote

Scenario 1 & 2 are essentially the same from caching perspective (the face B != B` means they have different caching hashes, but in both cases are cached).
Scenario 3 is the basically removing the cache flag from those components.

I am not sure, but think, this is exactly not what I meant 😄 . The scenario 3 is lenient regarding when to reuse old results. Did my explanation in this post clarify, what I meant?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

Last point on component caching, what I suggest is actually providing users the ability to control the cache "function". Right now (a bit simplified but probably accurate), this is equivalent to hashing of the following dict:

{"code": "code here", "container": "docker image", "container args": "docker args", "hyper-parameters": "key/value"}

We could allow users to add a function that get's this dict and returns a new dict that will be used for hashing. This way we will enable removing or changing of fields, like ignoring code, or some of the arguments, and having the ability to add new custom fields.

@<1523701205467926528:profile|AgitatedDove14> : Is the idea here the following? You want to use inversion-of-control such that I provide a function f to a component that takes the above dict an an input. Then I can do whatever I like inside the function f and return a different dict as output. If the output dict of f changes, the component is rerun; otherwise, the old output of the component is used?

I would like to add, but maybe, this is what you meant all along:

It would be great if you could search - among previously executed tasks - for a task which has the same f -output as my components f -output and use that old task's result; then, there is no new task created from the component-definition. Only if you cannot find such a task, the component is rerun as a new task. In other words, f is like a query for a task.

This would be an awesome and pretty streamlined feature. I like it not only because of its flexibility, but because you could get rid of other caching rules. I like it much, if an idea/concept is more general than other concepts, but also removes other concepts because of its generality.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> , you wrote

Components anyway need to be available when you define the pipeline controller/decorator, i.e. same codebase

No you an specify a different code base, see here:

Is the code in this "other" repo downloaded to the agent's machine? Or is the component's code pushed to the machine on which the repository is?
If the second case is true: How is the other machine (on which the other repo is lying on) turned into an agent?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

I am writing quite a bit of documentation on the topic of pipelines. I am happy to share the article here, once my questions are answered and we can make a pull request for the official documentation out of it.

Amazing please share once done, I will make sure we merge it into the docs!

Does this mean that within component or add_function_step I cannot use any code of my current directories code base, only code from external packages that are imported - unless I add my code with

helper_functions

?

Yes, I'll try to improve the docstring there.

It is important to realize that each decorated funciton will end up packaged in a spereate script file, and that script file will be running on the remote machine
To the above script you can add a repo, so that script file is running inside the repo.
But let's assume that in the first script we want more than just the decorated funciton, aha! we add the additional functions in the helper_functions arguments, and these funcitons will also be part of the standalone script file with our component. does that make sense @<1523704157695905792:profile|VivaciousBadger56> ?

If I do

not

build a package out of my local repository/project , I cannot reference anything

No need to build a packge from the repo, just pass it to as the repo args.
So for example:

@PipelineDecorator.component(return_values=['accuracy'], cache=True, task_type=TaskTypes.qc, repo="

")
def step_four(model, X_data, Y_data):
    print("yey")

What will happen is the agent will pull the " None " into a target folder (say ~/code) then it will create a new file called "step_four.py" and add that to the same ~/code folder.
Then it will run something like cd ~/code && PYTHONPATH=~/code python step_four.py
Make sense ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

In terms of creating dynamic pipelines and cyclic graphs, the decorator approach seems the most powerful to me.

Yes that is correct, the decorator approach is the most powerful one, I agree.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

@<1523701083040387072:profile|UnevenDolphin73> : No, I love it ❤ . Now, I just have to read everything 😄 .

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

I guess it depends on what you'd like to configure.
Since we let the user choose parents, component name, etc - we cannot use the decorators. We also infer required packages at runtime (the autodetection based on import statements fails with a non-trivial namespace) and need to set that to all components, so the decorators do not work for us.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

@<1523701205467926528:profile|AgitatedDove14> : In general: If I do not build a package out of my local repository/project , I cannot reference anything
from the local project/repository directly, right? I must make a package out of it, or I must reference it with the repo argument, or I must reference respective functions using the helper_functions argument. Did I get this right?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

in the second scenario, I might have not changed the results of the step, but my refactoring changed the speed considerably and this is something I measure.
in the third scenario, I might have not changed the results of the step and my refactoring just cleaned the code, but besides that, nothing substantially was changed. Thus I do not want a rerun.Well, I would say then that in the second scenario it’s just rerunning the pipeline, and in the third it’s not running it at all 😄
(I mean, in both the code may have changed, the only difference is that in the second case you’re interested in measuring it, and in the third, you’re not, so it sounds like a user-specific decision).

At any case, while I understand now what Martin meant, I still feel the function-based pipelines are the strongest option, because it sounds like you’re looking for a way to dynamically build your pipeline.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

mean? Is it not possible that I call code that is somewhere else on my local computer and/or in my code base? That makes things a bit complicated if my current repository is not somehow available to the agent.

I guess you can ignore this argument for the sake of simple discussion. If you need access to extra files/functions, just make sure you point the repo argument to their repo, and the agent will make sure your code is running from the repo root, with all the repo files under it. Make sense ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Heh, my bad, the term "user" is very much ingrained in our internal way of working. You can think of it as basically any technically-inclined person in your team or company.

Indeed the options in the WebUI are too limited for our use case, so we're developed "apps" that take a yaml configuration file and build a matching pipeline.
With that, our users do not need to code directly, and we can offer much more fine control over the pipeline.

As for the imports, what I meant is that I encountered an issue with pigar (I think that's the name of the package) - a package that ClearML uses to infer the requirements for each component automagically.
In our case, we have a monorepo with a common namespace for all the modules. Pigar fails to properly identify those, so our Pipeline Generator© captures the packages available at runtime and forces all components to have the same requirements.
This results in a bit slower startup time, of course.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					UnevenDolphin73
				
					0
					 × 1

@<1523701083040387072:profile|UnevenDolphin73> : I am not sure who you mean by "user"? I am not aware that we are building an app... 😄 Do you mean a person that reruns the entire pipeline but with different parameters from the Web UI? But here, we are not able to let the "user" configure all those things.

Is there some other way - that does not require any coding - to build pipelines (I am not aware)?

Also, when I build pipelines via tasks, the (same) imports had to be done in each task as well.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

@<1523701083040387072:profile|UnevenDolphin73> : A big point for me is to reuse/cache those artifacts/datasets/models that need to be passed between the steps, but have been produced by colleagues' executions at some earlier point. So for example, let the pipeline be A(a) -> B(b) -> C(c), where A,B,C are steps and their code, excluding configurations/parameters, and a,b,c are the configurations/parameters. Then I might have the situation, that my colleague ran the pipeline A(a) -> B(b) -> C(c).

Scenario 1: I run A(a) -> B(b') -> C(c) and I want that A(a) is not rerun, but its result reused/cached and only B(b') -> C(c') is run.
Scenario 2: I run A(a) -> B'(b) -> C(c) and I want that A(a) is not rerun, but its result reused/cached and only B'(b) -> C(c) is run.
Scenario 3: I run A(a) -> B'(b) -> C(c) and I want that nothing is rerun. Here, I only want that changes to the configuration, but not the code is considered.Which of the pipelines can be used for which Scenario?
(Yes, they are not academic hypothetical, I have those cases in real life.)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					VivaciousBadger56
				
					0
					 × 1

Answers 31