Caching can be a reason. Say you do some heavy data loading / processing in step 1. Now you're developing step 2.
It'd be nice not to have to re-run Step 1 every time you want to test a change to step 2.
You could find a way to simply write your output of step1 to disk and do everything in one step, or you could let ClearML handle that caching for you--with the added benefit that others collaborating remotely can also use the outputs of steps you've cached with ClearML
Oh there's parallelization as well. You could have step 1 gather the data, and then fan out to N parallel steps that all do different things with the data, for example hyper parameter tuning
About the caching: how does it work ? ClearML maintain it own cache and monitor if any of you code changes? Even code that get change inside an import ?
@<1576381444509405184:profile|ManiacalLizard2> , the rules for caching steps is as follows - First you need to enable it. Then assuming that there is no change of input from the previous time run AND there is no code change THEN use output from previous pipeline run. Code from imports shouldn't change since requirements are logged from previous runs and used in subsequent runs
I mean, what happen if I import and use function from another py file ? And that function code changes ?
Or you are expecting code should be frozen and only parameters changes between runs ?
If there is a change in code (Not just the script itself but a different commit / different uncommitted changes in the repo). Makes sense?
ok, so if git commit or uncommit changes differ from previous run, then the cache is "invalidated" and the step will be run again ?
Clear. Thanks @<1523701070390366208:profile|CostlyOstrich36> !