From my experience with the pipeline so far and "sub-node" idea, I would say:
Keep pipeline controller with possibility to define where to run whole pipeline (same node/pod) Every step can be pushed to be executed on different pod Every step is a Task but step can consist of multiple function which are "sub-node" and they must be executed on the same pod/node where the functional_step
is defined.
As a result if the pipeline requires sharing large files select the pipeline to run run on the same pod/node; if you want some steps to be executed on the others where docker container should be used– mark them;
UI appearance of "sub-nodes" per Task in the functional_step
of the pipeline allow transparently see everything -> will be used more often, everyone likes pretty and clear UI 🙂
Hi AgitatedDove14 storage.
Step 1 of the pipeline – generate file Step 2 of the pipeline – read file generated at the step 1
Yes I think it absolutely fine. Here is the pseudocode of my understanding with ClearML syntax:
`
def complex_steps(args):
As far as I see the functions should be implemented inside the step for ClearML be able to see them
@sub_node
def action_1(params):
....
return result
@sub_node
def action_2(params):
....
return result
@sub_node
def action_3(params_1, params_2):
....
return result
act1_result = action_1(args.param1)
act2_result = action_2(args.param2)
return action_3(act1_result, act2_result)
As a result of this function in clearML will be
complex_steps |
---------------------------------- |
pipe_c = PipelineController(strategy=single_agent)
pipe_c.add_functional_step(complex_steps, strategy=pipe_c.agent, outputs=[path_to_big_dataset])
pipe_c.add_functional_step(complex_steps2, strategy=pipe_c.agent, func_kwargs={paths=${complex_steps.path_to_big_dataset})
pipe_c.add_functional_step(complex_steps3, strategy=default)
dafault strategy means – use different pod for step `I hope it does make some sense 🙂
Yes, I that's what I found, otherwise clearml won't be able to see this function during execution time. I think it would be great to have such possibility because step can be constructed with multiple sub-components
but not all of them might be added to the UI graph. Some of them are just helper functions which will make code more readable
the intuition is: I care of the step result, and I also care what are the sub-steps in the step.
Example: step – evaluate model
, consists of dataset + model. I need substeps
download dataset download models evaluateI do not really care what will be in the substeps metrics, but I care what is stored in the evaluate model
step. It will make everything compact and easily accessable
because step can be constructed with multiple
sub-components
but not all of them might be added to the UI graph
Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?
Yes, but I'm not sure that they need to have separate task. In my opinion, it would be better if they are visible in the UI but all the metrics/artifacts are reported to the step Task
Makes total sense!
Interesting, you are defining the sub-component inside the function, I like that, this makes the code closer to how this is executed!
Ohh, clearml is designed so that you should not worry about that, download_dataset = StorageManger.get_local_copy()
this is cashed, meaning the machine that runs that like the second time will not re download the path.
This means step 1 is redundant, no?
Usually when data is passed between components it is automatically uploaded as artifact to the Task (stored on the files server or object storage etc.) then downloaded and passed to the next steps.
How large is the data that you are worried about download time ?
How are you spinning the agents ? usually when there are a lot of them, there is a shared cache folder, which means when one downloads the data the other do not need to redownload it
My reasoning is that pipelines can give me good visual overview of what is going on and I want to have a lot of small steps. My dataset is 2 Gb of images, and I want to have a step where I download it with StorageManger.get_local_copy()
save it and pass to the next steps only path to this datasets. But every agent is a different pod so I do not know how properly share the folder with images.
GrotesqueDog77 when you say "the second issue" , do you mean the fact that both step 1 and step 2 should have access to the same filesystem?
"sub nodes" inside pipeline, in my opinion, makes them much more useful, in sense that all the steps are visible.
Yeah I really like this idea... continuing this thread, would it also make sense to have a Task object per "sub-node" and run the sub-nodes as subprocess of the parent Node? I'm thinking this sounds like a combination of both local pipeline execution and remote pipeline execution.
wdyt?
I agree, a lot of packages should be installed before I can execute any command, but having something like "sub nodes" inside pipeline, in my opinion, makes them much more useful, in sense that all the steps are visible. I haven't used pipelines before and when I saw this UI I was thinking it would be very cool highlight the execution steps.
GrotesqueDog77 one issue with this design, in order to run a sub-component, the call must be done from the parent component, does that make sense?
` def step_one(data):
return data
def step_two(path):
return model
def both_steps()
path = step_one("stuff")
return step_two(path)
def pipeline():
both_steps() Which would make
both_steps ` a component and step_one and step_two sub-components
wdyt?
But every agent is a different pod so I do not know how properly share the folder with images.
Can I conclude Kubernetes running the agents ?
Sounds great! I really like that approach, thanks GrotesqueDog77 !
AgitatedDove14 thank for the link, but I need a different thing.
Step 1 of the pipeline I download images from s3 (many of them) and want to return paths Step 2 of the pipeline read images from that pathHere is a psedocode
` def step_one():
download_dataset = StorageManger.get_local_copy()
paths = collect_pathes_as_strings()
return paths
def step_two(paths):
image_1 = read_image(paths[0]) `
GrotesqueDog77 this should just work, decorate the functions with @PipelineDecorator.component
and call the functions one after the otherpaths = step_one() step_two(paths)
ClearML will make sure it serializes the strings and pass them to step two (of course step two should actually run on a machine with access to the same folder, but this is another issue 🙂 )
If you take a look here, the returned objects are automatically serialized and stored on the files server or object storage, and also deserialized when passed to the next step.
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
You can of course do the same manually
AgitatedDove14 maybe you have idea how to deal with the second issue? because this is exactly what I want to get 🙂
Hi GrotesqueDog77
What do you mean by share resources? Do you mean compute or storage?
AgitatedDove14 Yeah, you are right since sub component is not a task than I the caching won't work. but it is a step result what's important so if the step cache is available I think it should cover the majority of pipeline usecases.
Yes, but I'm not sure that they need to have separate task
Hmm okay I need to check if this can be easily done
(BTW, the downside of that, you can only cache a component, not a sub-component)
Sounds good to me, adding it to the to do list, probably should not be very complicated to add 🙂
🤔 maybe we should have "sub nodes" as just visual functions running inside the same actual pipeline component ?
If this is the case, then you have to set a shared PV for the pods, this way they can actually have a persistent cache, which would also be shared.
BTW: a single function call might not be a perfect match for a pipeline component , the overhead of starting a node might not be negligible as it needs to install required python packages bring the code etc.