looks promising. couple of questions:
wdym 'executed on different machines'? is there an mlclearish way of running a pipeline, ie something instead of implementing my own run method? i did my own run because i wanted to organize each pipeline into its own experiment folder and skip stages if they were already ran but it feels hacky and you folks have prolly a better way of doing this
wdym 'executed on different machines'?The assumption is that you have machines (i.e. clearml-agents) connected to clearml, which would be running all the different components of the pipeline. Think out of the box scale-up. Each component will become a standalone Job and the data will be passed (i.e. stored and loaded) automatically on the clearml-server (can be configured to be external object storage as well). This means if you have a step that needs GPU it will be launched on a GPU machine vs steps that are cpu/logic. Make sense ?
is there an mlclearish way of running a pipeline, ie something instead of implementing my own run method? i
What do you mean by "i did my own run because i wanted" ? Maybe a few clearml example s would help?
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
Does that help?
Thanks ContemplativePuppy11 !
How would you pass data/args between one step of the pipeline to another ?
Or are you saying the pipeline class itself stores all the components ?
Hi ContemplativePuppy11
This is really interesting point.
Maybe you can provide a pseudo class abstract of your current pipeline design, this will help in trying to understand what you are trying to achieve and how to make it easier to get there
AgitatedDove14 currently we use mlflow in some custom code to log and load artifacts
sure AgitatedDove14 . boiled down my pipeline into bare bones functionality and one file
ContemplativePuppy11
yes, nice move. my question was to make sure that the steps are not run in parallel because each one builds upon the previous one
if they are "calling" one another (or passing data) then the pipeline logic will deduce they cannot run in parallel 🙂 basically it is automatic
so my takeaway is that if the funcs are class methods the decorators wont break, right?
In theory, but the idea of the decorator is that it tracks the return value so it "knows" how to pass the data between the function (i.e. pass the reference to the data that is actually being stored as an artifact). This same mechanism allows it to know which function depends on which output of another function. This means that instantiating a class will actually be less efficient, and in practice might not work. does that make sense ?
This means if you have a step that needs GPU it will be launched on a GPU machine vs steps that are cpu/logic. Make sense ?
yes, nice move. my question was to make sure that the steps are not run in parallel because each one builds upon the previous one
Maybe a few clearml example s would help?
id checked out that file but now with your explanation it is clear to me how to do it. so my takeaway is that if the funcs are class methods the decorators wont break, right? i had had a problem once with another library and just wanted to be sure (i think it had to be with the whole class having to be serialized and not only the method)
each child of Pipeline
is a self contained pipeine, eg ModelPipeline.
each step of the pipeline is a method, the order being set in the attribute array stage_handler_mapping
. in the mlflow ui each stage, i.e. each methods results, is represented as a run within a fixed experiment
That makes sense to me, what do you think about the following:
` from clearml import PipelineDecorator
class AbstractPipeline(object):
def init():
pass
@PipelineDecorator.pipeline(...)
def run(self, run_arg):
data = self.step1(run_arg)
final_model = self.step2(data)
self.upload_model(final_model)
@PipelineDecorator.component(...)
def step1(self, arg_a):
# do something
return value
@PipelineDecorator.component(...)
def step2(self, arg_b):
# do something
return value This would mean steps 1/2 are executed on different machines, where the data passed between them is automatically serialized. It also allows you to build the actual logic in
def run ` that drives the different components.
wdyt?
I think my question is more about design, is a ModelPipeline class a self contained pipeline? (i.e. containing all the different steps or is it a single step in a pipeline)