Hey, I’m thinking of using a ClearML Pipeline to compile a dataset more efficiently.
My hope is that I won’t have to run every step for every data point every time, as the dataset is big and some of the steps are intensive etc.
I am at a stage where I will be switching out models and algorithms rapidly to try and find the best combinations, and adding / removing Tasks (e.g. to create new Features), so it’s important to me that the process of compiling the dataset is as quick and traceable as possible.
How would I set up a ClearML Pipeline/Tasks (Pipeline components) such that:
If the Task has been run before with the same code & model & input data, the Task is not run again and instead cached outputs (e.g. features) are passed onto the next Task(s) in the Pipeline If code or model for a Task has been updated, all input data are processed (with the results being passed on to downstream Task(s)) If code or model for a Task has not changed but some input data has changed, only run the Task on the new input data, then combine the newly processed outputs with the (correct) previously-computed+cached outputs If new Tasks are added to the Pipeline, (e.g. adding the requisite Tasks to create a new Feature in the final CSV), the old Tasks should still function as in 1, 2 and 3
Is there a good way to do this?