I Am Struggling A Bit To Understand The Use Case Of A Pipeline: Let Say You Have Step1 -> Step2 -> Step3 What Is The Point To Use Pipeline Feature Versus Having A Single Task That Do Those Steps One After Another ???

Answered

I am struggling a bit to understand the use case of a pipeline:
Let say you have step1 -> step2 -> step3
What is the point to use pipeline feature versus having a single task that do those steps one after another ???

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Votes Newest

Answers 9

Caching can be a reason. Say you do some heavy data loading / processing in step 1. Now you're developing step 2.

It'd be nice not to have to re-run Step 1 every time you want to test a change to step 2.

You could find a way to simply write your output of step1 to disk and do everything in one step, or you could let ClearML handle that caching for you--with the added benefit that others collaborating remotely can also use the outputs of steps you've cached with ClearML

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

Clear. Thanks @<1523701070390366208:profile|CostlyOstrich36> !

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Yep

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

If there is a change in code (Not just the script itself but a different commit / different uncommitted changes in the repo). Makes sense?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

About the caching: how does it work ? ClearML maintain it own cache and monitor if any of you code changes? Even code that get change inside an import ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Oh there's parallelization as well. You could have step 1 gather the data, and then fan out to N parallel steps that all do different things with the data, for example hyper parameter tuning

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					BattyCrocodile47
				
					0
					 × 1

ok, so if git commit or uncommit changes differ from previous run, then the cache is "invalidated" and the step will be run again ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

@<1576381444509405184:profile|ManiacalLizard2> , the rules for caching steps is as follows - First you need to enable it. Then assuming that there is no change of input from the previous time run AND there is no code change THEN use output from previous pipeline run. Code from imports shouldn't change since requirements are logged from previous runs and used in subsequent runs

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					CostlyOstrich36
				
					0

I mean, what happen if I import and use function from another py file ? And that function code changes ?
Or you are expecting code should be frozen and only parameters changes between runs ?

  				
Posted 
	2 years ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Write your answer

2K Views

9 Answers

2 years ago