Question About Pipeline : My Setup Is As Follow:

Answered

Question about Pipeline : my setup is as follow:

Step1: train_foo.py (which import config from other .py) => generate a model.pt
Step2: gen_bar.py <data> + model.pt (from step1) => generate a dataFrameI understand that I can wrap this in a pipeline and each time I run it, step1 then step2 is executed.
Now, is it possible to just change <data> in step2 , run the pipeline and it's smart enough to know that Step1 don't need to re-run ?
How does it know that Step1 "did not change" ? Is it purely from input parameter ? md5sum of the input parameter content if it's a path to a file ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

Votes Newest

Answers 13

yup, but you can modify them after task creation in the UI (if its in draft state)
it's upon runtime instantiation of the pipelinecontroller class.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

following your example, if the seeds are hard coded in the code, then git hash will detect if changed happen and the step need to be run or not

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

I think of draft tasks as "class definitions" that the pipeline uses to create task "objects" out of.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

thanks for all the pointer ! I will try to have a good play around

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

basically the git hash of the executed experiment + a hash on the inputs to the task.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

the pipeline is to orchestrate tasks to create more complex functionality, and take advantage of caching, yes.

here I run backtesting (how well did i predict the future), and can control frequency "every week, every month" etc.
so if I increase frequency, I dont need to rerun certain branches of the pipeline and therefore they are cached. another example: if I change something that impacts layer 3 but not layer 1-2, then about half my tasks are cached.

the pictured pipeline is: "create data, then optionally filter it, then train/eval, then summarize average performance"

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

may be I will play around a bit and ask more specific questions .... It's just I cannot find much docs around how the pipeline caching work (which is the main point of pipeline ?)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

To me the whole point of having pipeline is to have a system that "know" previous state and make "smart" decision on what should run and what not. If it's just about if then else, then code already handle all that.
And what I struggle a bit is to find doc on how it determine the existing state and how it make decision what to run. thus the initial question

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

how does it work if I create my pipeline from code ? Does the task will get the git repo state when first run and use commit hash and uncommited changed as "signature" ?

  				
Posted 
	one year ago

					More
				  		
  Report
		
					ManiacalLizard2
				
					0
					 × 1

that same pipeline with just 1 date input.
i have the flexibility from the UI to either run a single, a dozen, or a hundred experiments... in parallel.

pipelines are amazing 😃

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

Pipeline step caching matches on inputs and task status. If your task points to latest commit, clearml can’t know what that is until runtime and cant cache. On a fixed tag or commit, it sees no code has changed, and so if inputs match (hashable, all parameters are serializable), then it caches.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

and yes, you're correct. I'd say this is exactly what clearml pipelines offer.
the smartness is simple enough: same inputs are assumed to create the same outputs (it's up to YOU to ensure your tasks satisfy this determinism... e.g. seeds are either hard-coded or inputs to a task)

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

yeah. it's using what you see in the UI here.
so if you made a change to a task used in a pipeline (my pipelines are from tasks, not functions... can't speak to that but i think it just generates a hidden task under the hood), point the (draft) task to that commit (assuming it's pushed), or re-run the task. the pipeline picks up from the tasks the API is aware of (by id or by name, in which case it uses latest updated) under the specified project, not from local code.

that part was confusing for me to understand at first, but now I have the mental model for how they work.

  				
Posted 
	one year ago

					More
				  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

Write your answer

1K Views

13 Answers

one year ago