that same pipeline with just 1 date input.
i have the flexibility from the UI to either run a single, a dozen, or a hundred experiments... in parallel.
pipelines are amazing 😃
I think of draft tasks as "class definitions" that the pipeline uses to create task "objects" out of.
following your example, if the seeds are hard coded in the code, then git hash will detect if changed happen and the step need to be run or not
yeah. it's using what you see in the UI here.
so if you made a change to a task used in a pipeline (my pipelines are from tasks, not functions... can't speak to that but i think it just generates a hidden task under the hood), point the (draft) task to that commit (assuming it's pushed), or re-run the task. the pipeline picks up from the tasks the API is aware of (by id or by name, in which case it uses latest updated) under the specified project, not from local code.
that part was confusing for me to understand at first, but now I have the mental model for how they work.
how does it work if I create my pipeline from code ? Does the task will get the git repo state when first run and use commit hash and uncommited changed as "signature" ?
yup, but you can modify them after task creation in the UI (if its in draft state)
it's upon runtime instantiation of the pipelinecontroller class.
and yes, you're correct. I'd say this is exactly what clearml pipelines offer.
the smartness is simple enough: same inputs are assumed to create the same outputs (it's up to YOU to ensure your tasks satisfy this determinism... e.g. seeds are either hard-coded or inputs to a task)
Pipeline step caching matches on inputs and task status. If your task points to latest commit, clearml can’t know what that is until runtime and cant cache. On a fixed tag or commit, it sees no code has changed, and so if inputs match (hashable, all parameters are serializable), then it caches.
thanks for all the pointer ! I will try to have a good play around
may be I will play around a bit and ask more specific questions .... It's just I cannot find much docs around how the pipeline caching work (which is the main point of pipeline ?)
the pipeline is to orchestrate tasks to create more complex functionality, and take advantage of caching, yes.
here I run backtesting (how well did i predict the future), and can control frequency "every week, every month" etc.
so if I increase frequency, I dont need to rerun certain branches of the pipeline and therefore they are cached. another example: if I change something that impacts layer 3 but not layer 1-2, then about half my tasks are cached.
the pictured pipeline is: "create data, then optionally filter it, then train/eval, then summarize average performance"
basically the git hash of the executed experiment + a hash on the inputs to the task.
To me the whole point of having pipeline is to have a system that "know" previous state and make "smart" decision on what should run and what not. If it's just about if then else, then code already handle all that.
And what I struggle a bit is to find doc on how it determine the existing state and how it make decision what to run. thus the initial question