Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Question About Pipeline : My Setup Is As Follow:

Question about Pipeline : my setup is as follow:

  • Step1: train_foo.py (which import config from other .py) => generate a model.pt
  • Step2: gen_bar.py <data> + model.pt (from step1) => generate a dataFrameI understand that I can wrap this in a pipeline and each time I run it, step1 then step2 is executed.
    Now, is it possible to just change <data> in step2 , run the pipeline and it's smart enough to know that Step1 don't need to re-run ?
    How does it know that Step1 "did not change" ? Is it purely from input parameter ? md5sum of the input parameter content if it's a path to a file ?
  
  
Posted 4 months ago
Votes Newest

Answers 13


basically the git hash of the executed experiment + a hash on the inputs to the task.

  
  
Posted 4 months ago

how does it work if I create my pipeline from code ? Does the task will get the git repo state when first run and use commit hash and uncommited changed as "signature" ?

  
  
Posted 4 months ago

the pipeline is to orchestrate tasks to create more complex functionality, and take advantage of caching, yes.

here I run backtesting (how well did i predict the future), and can control frequency "every week, every month" etc.
so if I increase frequency, I dont need to rerun certain branches of the pipeline and therefore they are cached. another example: if I change something that impacts layer 3 but not layer 1-2, then about half my tasks are cached.

the pictured pipeline is: "create data, then optionally filter it, then train/eval, then summarize average performance"
image

  
  
Posted 4 months ago

following your example, if the seeds are hard coded in the code, then git hash will detect if changed happen and the step need to be run or not

  
  
Posted 4 months ago

yup, but you can modify them after task creation in the UI (if its in draft state)
it's upon runtime instantiation of the pipelinecontroller class.

  
  
Posted 4 months ago

yeah. it's using what you see in the UI here.
so if you made a change to a task used in a pipeline (my pipelines are from tasks, not functions... can't speak to that but i think it just generates a hidden task under the hood), point the (draft) task to that commit (assuming it's pushed), or re-run the task. the pipeline picks up from the tasks the API is aware of (by id or by name, in which case it uses latest updated) under the specified project, not from local code.

that part was confusing for me to understand at first, but now I have the mental model for how they work.
image

  
  
Posted 4 months ago

Pipeline step caching matches on inputs and task status. If your task points to latest commit, clearml can’t know what that is until runtime and cant cache. On a fixed tag or commit, it sees no code has changed, and so if inputs match (hashable, all parameters are serializable), then it caches.

  
  
Posted 4 months ago

and yes, you're correct. I'd say this is exactly what clearml pipelines offer.
the smartness is simple enough: same inputs are assumed to create the same outputs (it's up to YOU to ensure your tasks satisfy this determinism... e.g. seeds are either hard-coded or inputs to a task)

  
  
Posted 4 months ago

may be I will play around a bit and ask more specific questions .... It's just I cannot find much docs around how the pipeline caching work (which is the main point of pipeline ?)

  
  
Posted 4 months ago

I think of draft tasks as "class definitions" that the pipeline uses to create task "objects" out of.

  
  
Posted 4 months ago

that same pipeline with just 1 date input.
i have the flexibility from the UI to either run a single, a dozen, or a hundred experiments... in parallel.

pipelines are amazing 😃
image

  
  
Posted 4 months ago

To me the whole point of having pipeline is to have a system that "know" previous state and make "smart" decision on what should run and what not. If it's just about if then else, then code already handle all that.
And what I struggle a bit is to find doc on how it determine the existing state and how it make decision what to run. thus the initial question

  
  
Posted 4 months ago

thanks for all the pointer ! I will try to have a good play around

  
  
Posted 4 months ago