Answered

Hi All! I Have A Question About Pipelines. My Pipeline Consists Of Several Steps:

Hi all! I have a question about pipelines. My pipeline consists of several steps:
run some computations and generate image of confusion matrix path image path to the next step where I read from fs this image and push it to s3
Does anyone tried to share resources between pipeline functional_step 's?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

Votes Newest

Answers 28

I agree, a lot of packages should be installed before I can execute any command, but having something like "sub nodes" inside pipeline, in my opinion, makes them much more useful, in sense that all the steps are visible. I haven't used pipelines before and when I saw this UI I was thinking it would be very cool highlight the execution steps.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

But every agent is a different pod so I do not know how properly share the folder with images.

Can I conclude Kubernetes running the agents ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sounds good to me, adding it to the to do list, probably should not be very complicated to add 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yes

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

From my experience with the pipeline so far and "sub-node" idea, I would say:
Keep pipeline controller with possibility to define where to run whole pipeline (same node/pod) Every step can be pushed to be executed on different pod Every step is a Task but step can consist of multiple function which are "sub-node" and they must be executed on the same pod/node where the functional_step is defined.
As a result if the pipeline requires sharing large files select the pipeline to run run on the same pod/node; if you want some steps to be executed on the others where docker container should be used– mark them;
UI appearance of "sub-nodes" per Task in the functional_step of the pipeline allow transparently see everything -> will be used more often, everyone likes pretty and clear UI 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

Ohh, clearml is designed so that you should not worry about that, download_dataset = StorageManger.get_local_copy() this is cashed, meaning the machine that runs that like the second time will not re download the path.
This means step 1 is redundant, no?
Usually when data is passed between components it is automatically uploaded as artifact to the Task (stored on the files server or object storage etc.) then downloaded and passed to the next steps.
How large is the data that you are worried about download time ?
How are you spinning the agents ? usually when there are a lot of them, there is a shared cache folder, which means when one downloads the data the other do not need to redownload it

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If this is the case, then you have to set a shared PV for the pods, this way they can actually have a persistent cache, which would also be shared.
BTW: a single function call might not be a perfect match for a pipeline component , the overhead of starting a node might not be negligible as it needs to install required python packages bring the code etc.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

GrotesqueDog77 one issue with this design, in order to run a sub-component, the call must be done from the parent component, does that make sense?

` def step_one(data):
return data

def step_two(path):
return model

def both_steps()
path = step_one("stuff")
return step_two(path)

def pipeline():
both_steps() Which would make both_steps ` a component and step_one and step_two sub-components
wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

because step can be constructed with multiple

sub-components

but not all of them might be added to the UI graph

Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

🤔 maybe we should have "sub nodes" as just visual functions running inside the same actual pipeline component ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Sounds great! I really like that approach, thanks GrotesqueDog77 !

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Yes, I that's what I found, otherwise clearml won't be able to see this function during execution time. I think it would be great to have such possibility because step can be constructed with multiple sub-components but not all of them might be added to the UI graph. Some of them are just helper functions which will make code more readable

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

AgitatedDove14 Yeah, you are right since sub component is not a task than I the caching won't work. but it is a step result what's important so if the step cache is available I think it should cover the majority of pipeline usecases.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

Yes, but I'm not sure that they need to have separate task. In my opinion, it would be better if they are visible in the UI but all the metrics/artifacts are reported to the step Task

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

My reasoning is that pipelines can give me good visual overview of what is going on and I want to have a lot of small steps. My dataset is 2 Gb of images, and I want to have a step where I download it with StorageManger.get_local_copy() save it and pass to the next steps only path to this datasets. But every agent is a different pod so I do not know how properly share the folder with images.

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

AgitatedDove14 thank for the link, but I need a different thing.
Step 1 of the pipeline I download images from s3 (many of them) and want to return paths Step 2 of the pipeline read images from that pathHere is a psedocode

` def step_one():
download_dataset = StorageManger.get_local_copy()
paths = collect_pathes_as_strings()
return paths

def step_two(paths):
image_1 = read_image(paths[0]) `

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

GrotesqueDog77 this should just work, decorate the functions with @PipelineDecorator.component and call the functions one after the other
paths = step_one() step_two(paths)ClearML will make sure it serializes the strings and pass them to step two (of course step two should actually run on a machine with access to the same folder, but this is another issue 🙂 )

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

If you take a look here, the returned objects are automatically serialized and stored on the files server or object storage, and also deserialized when passed to the next step.
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py

You can of course do the same manually

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

yes

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

Yes I think it absolutely fine. Here is the pseudocode of my understanding with ClearML syntax:
`
def complex_steps(args):

As far as I see the functions should be implemented inside the step for ClearML be able to see them

@sub_node
def action_1(params):
....
return result

@sub_node
def action_2(params):
....
return result

@sub_node
def action_3(params_1, params_2):
....
return result
act1_result = action_1(args.param1)
act2_result = action_2(args.param2)
return action_3(act1_result, act2_result)

As a result of this function in clearML will be

complex_steps





----------------------------------

pipe_c = PipelineController(strategy=single_agent)
pipe_c.add_functional_step(complex_steps, strategy=pipe_c.agent, outputs=[path_to_big_dataset])
pipe_c.add_functional_step(complex_steps2, strategy=pipe_c.agent, func_kwargs={paths=${complex_steps.path_to_big_dataset})
pipe_c.add_functional_step(complex_steps3, strategy=default)

dafault strategy means – use different pod for step `I hope it does make some sense 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

Yes, but I'm not sure that they need to have separate task

Hmm okay I need to check if this can be easily done
(BTW, the downside of that, you can only cache a component, not a sub-component)

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Makes total sense!
Interesting, you are defining the sub-component inside the function, I like that, this makes the code closer to how this is executed!

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

"sub nodes" inside pipeline, in my opinion, makes them much more useful, in sense that all the steps are visible.

Yeah I really like this idea... continuing this thread, would it also make sense to have a Task object per "sub-node" and run the sub-nodes as subprocess of the parent Node? I'm thinking this sounds like a combination of both local pipeline execution and remote pipeline execution.
wdyt?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

AgitatedDove14 maybe you have idea how to deal with the second issue? because this is exactly what I want to get 🙂

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

Hi AgitatedDove14 storage.
Step 1 of the pipeline – generate file Step 2 of the pipeline – read file generated at the step 1

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

Hi GrotesqueDog77
What do you mean by share resources? Do you mean compute or storage?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

the intuition is: I care of the step result, and I also care what are the sub-steps in the step.

Example: step – evaluate model , consists of dataset + model. I need substeps
download dataset download models evaluateI do not really care what will be in the substeps metrics, but I care what is stored in the evaluate model step. It will make everything compact and easily accessable

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					GrotesqueDog77
				
					0
					 × 1

GrotesqueDog77 when you say "the second issue" , do you mean the fact that both step 1 and step 2 should have access to the same filesystem?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					AgitatedDove14
				
					0
					 × 1

Write your answer

981 Views

28 Answers

2 years ago

one year ago