Question About Pipeline : My Setup Is As Follow:

Answered

Question about Pipeline : my setup is as follow:

Step1: train_foo.py (which import config from other .py) => generate a model.pt
Step2: gen_bar.py <data> + model.pt (from step1) => generate a dataFrameI understand that I can wrap this in a pipeline and each time I run it, step1 then step2 is executed.
Now, is it possible to just change <data> in step2 , run the pipeline and it's smart enough to know that Step1 don't need to re-run ?
How does it know that Step1 "did not change" ? Is it purely from input parameter ? md5sum of the input parameter content if it's a path to a file ?

  				
Posted 
	8 months ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Votes Newest

Answers 13

yeah. it's using what you see in the UI here.
so if you made a change to a task used in a pipeline (my pipelines are from tasks, not functions... can't speak to that but i think it just generates a hidden task under the hood), point the (draft) task to that commit (assuming it's pushed), or re-run the task. the pipeline picks up from the tasks the API is aware of (by id or by name, in which case it uses latest updated) under the specified project, not from local code.

that part was confusing for me to understand at first, but now I have the mental model for how they work.

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

may be I will play around a bit and ask more specific questions .... It's just I cannot find much docs around how the pipeline caching work (which is the main point of pipeline ?)

  				
Posted 
	8 months ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

following your example, if the seeds are hard coded in the code, then git hash will detect if changed happen and the step need to be run or not

  				
Posted 
	8 months ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

yup, but you can modify them after task creation in the UI (if its in draft state)
it's upon runtime instantiation of the pipelinecontroller class.

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

the pipeline is to orchestrate tasks to create more complex functionality, and take advantage of caching, yes.

here I run backtesting (how well did i predict the future), and can control frequency "every week, every month" etc.
so if I increase frequency, I dont need to rerun certain branches of the pipeline and therefore they are cached. another example: if I change something that impacts layer 3 but not layer 1-2, then about half my tasks are cached.

the pictured pipeline is: "create data, then optionally filter it, then train/eval, then summarize average performance"

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

Pipeline step caching matches on inputs and task status. If your task points to latest commit, clearml can’t know what that is until runtime and cant cache. On a fixed tag or commit, it sees no code has changed, and so if inputs match (hashable, all parameters are serializable), then it caches.

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

and yes, you're correct. I'd say this is exactly what clearml pipelines offer.
the smartness is simple enough: same inputs are assumed to create the same outputs (it's up to YOU to ensure your tasks satisfy this determinism... e.g. seeds are either hard-coded or inputs to a task)

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

thanks for all the pointer ! I will try to have a good play around

  				
Posted 
	8 months ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

To me the whole point of having pipeline is to have a system that "know" previous state and make "smart" decision on what should run and what not. If it's just about if then else, then code already handle all that.
And what I struggle a bit is to find doc on how it determine the existing state and how it make decision what to run. thus the initial question

  				
Posted 
	8 months ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

I think of draft tasks as "class definitions" that the pipeline uses to create task "objects" out of.

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

basically the git hash of the executed experiment + a hash on the inputs to the task.

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

that same pipeline with just 1 date input.
i have the flexibility from the UI to either run a single, a dozen, or a hundred experiments... in parallel.

pipelines are amazing 😃

  				
Posted 
	8 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

how does it work if I create my pipeline from code ? Does the task will get the git repo state when first run and use commit hash and uncommited changed as "signature" ?

  				
Posted 
	8 months ago

					More  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Write your answer

708 Views

13 Answers

8 months ago