Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All! I Have A Question About Pipelines. My Pipeline Consists Of Several Steps:

Hi all! I have a question about pipelines. My pipeline consists of several steps:
run some computations and generate image of confusion matrix path image path to the next step where I read from fs this image and push it to s3
Does anyone tried to share resources between pipeline functional_step 's?

  
  
Posted 2 years ago
Votes Newest

Answers 28


Yes, I that's what I found, otherwise clearml won't be able to see this function during execution time. I think it would be great to have such possibility because step can be constructed with multiple sub-components but not all of them might be added to the UI graph. Some of them are just helper functions which will make code more readable

  
  
Posted 2 years ago

Ohh, clearml is designed so that you should not worry about that, download_dataset = StorageManger.get_local_copy() this is cashed, meaning the machine that runs that like the second time will not re download the path.
This means step 1 is redundant, no?
Usually when data is passed between components it is automatically uploaded as artifact to the Task (stored on the files server or object storage etc.) then downloaded and passed to the next steps.
How large is the data that you are worried about download time ?
How are you spinning the agents ? usually when there are a lot of them, there is a shared cache folder, which means when one downloads the data the other do not need to redownload it

  
  
Posted 2 years ago

GrotesqueDog77 when you say "the second issue" , do you mean the fact that both step 1 and step 2 should have access to the same filesystem?

  
  
Posted 2 years ago

Sounds great! I really like that approach, thanks GrotesqueDog77 !

  
  
Posted 2 years ago

Hi GrotesqueDog77
What do you mean by share resources? Do you mean compute or storage?

  
  
Posted 2 years ago

AgitatedDove14 thank for the link, but I need a different thing.
Step 1 of the pipeline I download images from s3 (many of them) and want to return paths Step 2 of the pipeline read images from that pathHere is a psedocode

` def step_one():
download_dataset = StorageManger.get_local_copy()
paths = collect_pathes_as_strings()
return paths

def step_two(paths):
image_1 = read_image(paths[0]) `

  
  
Posted 2 years ago

Yes, but I'm not sure that they need to have separate task. In my opinion, it would be better if they are visible in the UI but all the metrics/artifacts are reported to the step Task

  
  
Posted 2 years ago

Hi AgitatedDove14 storage.
Step 1 of the pipeline – generate file Step 2 of the pipeline – read file generated at the step 1

  
  
Posted 2 years ago

I agree, a lot of packages should be installed before I can execute any command, but having something like "sub nodes" inside pipeline, in my opinion, makes them much more useful, in sense that all the steps are visible. I haven't used pipelines before and when I saw this UI I was thinking it would be very cool highlight the execution steps.

  
  
Posted 2 years ago

If this is the case, then you have to set a shared PV for the pods, this way they can actually have a persistent cache, which would also be shared.
BTW: a single function call might not be a perfect match for a pipeline component , the overhead of starting a node might not be negligible as it needs to install required python packages bring the code etc.

  
  
Posted 2 years ago

From my experience with the pipeline so far and "sub-node" idea, I would say:
Keep pipeline controller with possibility to define where to run whole pipeline (same node/pod) Every step can be pushed to be executed on different pod Every step is a Task but step can consist of multiple function which are "sub-node" and they must be executed on the same pod/node where the functional_step is defined.
As a result if the pipeline requires sharing large files select the pipeline to run run on the same pod/node; if you want some steps to be executed on the others where docker container should be used– mark them;
UI appearance of "sub-nodes" per Task in the functional_step of the pipeline allow transparently see everything -> will be used more often, everyone likes pretty and clear UI 🙂

  
  
Posted 2 years ago

yes

  
  
Posted 2 years ago

the intuition is: I care of the step result, and I also care what are the sub-steps in the step.

Example: step – evaluate model , consists of dataset + model. I need substeps
download dataset download models evaluateI do not really care what will be in the substeps metrics, but I care what is stored in the evaluate model step. It will make everything compact and easily accessable

  
  
Posted 2 years ago

If you take a look here, the returned objects are automatically serialized and stored on the files server or object storage, and also deserialized when passed to the next step.
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py

You can of course do the same manually

  
  
Posted 2 years ago

"sub nodes" inside pipeline, in my opinion, makes them much more useful, in sense that all the steps are visible.

Yeah I really like this idea... continuing this thread, would it also make sense to have a Task object per "sub-node" and run the sub-nodes as subprocess of the parent Node? I'm thinking this sounds like a combination of both local pipeline execution and remote pipeline execution.
wdyt?

  
  
Posted 2 years ago

GrotesqueDog77 one issue with this design, in order to run a sub-component, the call must be done from the parent component, does that make sense?

` def step_one(data):
return data

def step_two(path):
return model

def both_steps()
path = step_one("stuff")
return step_two(path)

def pipeline():
both_steps() Which would make both_steps ` a component and step_one and step_two sub-components
wdyt?

  
  
Posted 2 years ago

My reasoning is that pipelines can give me good visual overview of what is going on and I want to have a lot of small steps. My dataset is 2 Gb of images, and I want to have a step where I download it with StorageManger.get_local_copy() save it and pass to the next steps only path to this datasets. But every agent is a different pod so I do not know how properly share the folder with images.

  
  
Posted 2 years ago

Sounds good to me, adding it to the to do list, probably should not be very complicated to add 🙂

  
  
Posted 2 years ago

yes

  
  
Posted 2 years ago

🤔 maybe we should have "sub nodes" as just visual functions running inside the same actual pipeline component ?

  
  
Posted 2 years ago

AgitatedDove14 maybe you have idea how to deal with the second issue? because this is exactly what I want to get 🙂

  
  
Posted 2 years ago

AgitatedDove14 Yeah, you are right since sub component is not a task than I the caching won't work. but it is a step result what's important so if the step cache is available I think it should cover the majority of pipeline usecases.

  
  
Posted 2 years ago

But every agent is a different pod so I do not know how properly share the folder with images.

Can I conclude Kubernetes running the agents ?

  
  
Posted 2 years ago

GrotesqueDog77 this should just work, decorate the functions with @PipelineDecorator.component and call the functions one after the other
paths = step_one() step_two(paths)ClearML will make sure it serializes the strings and pass them to step two (of course step two should actually run on a machine with access to the same folder, but this is another issue 🙂 )

  
  
Posted 2 years ago

Yes I think it absolutely fine. Here is the pseudocode of my understanding with ClearML syntax:
`
def complex_steps(args):

As far as I see the functions should be implemented inside the step for ClearML be able to see them

@sub_node
def action_1(params):
....
return result

@sub_node
def action_2(params):
....
return result

@sub_node
def action_3(params_1, params_2):
....
return result
act1_result = action_1(args.param1)
act2_result = action_2(args.param2)
return action_3(act1_result, act2_result)

As a result of this function in clearML will be

complex_steps
----------------------------------

pipe_c = PipelineController(strategy=single_agent)
pipe_c.add_functional_step(complex_steps, strategy=pipe_c.agent, outputs=[path_to_big_dataset])
pipe_c.add_functional_step(complex_steps2, strategy=pipe_c.agent, func_kwargs={paths=${complex_steps.path_to_big_dataset})
pipe_c.add_functional_step(complex_steps3, strategy=default)

dafault strategy means – use different pod for step `I hope it does make some sense 🙂

  
  
Posted 2 years ago

because step can be constructed with multiple

sub-components

but not all of them might be added to the UI graph

Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?

  
  
Posted 2 years ago

Yes, but I'm not sure that they need to have separate task

Hmm okay I need to check if this can be easily done
(BTW, the downside of that, you can only cache a component, not a sub-component)

  
  
Posted 2 years ago

Makes total sense!
Interesting, you are defining the sub-component inside the function, I like that, this makes the code closer to how this is executed!

  
  
Posted 2 years ago
925 Views
28 Answers
2 years ago
one year ago
Tags