Could I Get Some Feedback From People With Experience Using Clearml Pipelines On The Best Way To Handle Caching? My Team Is Working On Configuring Clearml Pipelines For Our Data Processing Workflow. We Currently Have An Experimental Pipeline Configured F

Answered

Could I get some feedback from people with experience using ClearML pipelines on the best way to handle caching? My team is working on configuring ClearML Pipelines for our data processing workflow.

We currently have an experimental pipeline configured for batch data processing. It runs a basic algorithm on each item provided as input, essentially just mapping each input piece of data to a new, processed output. However the algorithm we run is somewhat expensive, and we want to be able to cache as much computation as possible. If we run the pipeline with 1000 items from our ClearML Data dataset, and then add another item, when we re-run the pipeline with those 1001 items as input, we want to be able to cache all the previous computation and only have to process the single new item.

As far as I can tell, the built-in ClearML pipeline cache features will re-run the entire pipeline step if the input changes at all, so when the new item is added the entire batch pipeline step will re-run with all 1001 items.

What’re the best practices for handling this with ClearML? I’d really appreciate any information anyone can share about their experiences with this. Thank you :)

  				
Posted 
	9 months ago

					More  		
  Report
		
					TartFox93
				
					0
					 × 1

Votes Newest

Answers 2

Thank you, that’s super helpful! I’ll work on my own caching logic for tasks then. I appreciate all the information

  				
Posted 
	9 months ago

					More  		
  Report
		
					TartFox93
				
					0
					 × 1

It sounds like you understand the limitations correctly.

As far as I know, it'd be up to you to write your own code that computes the delta between old and new and only re-process the new entries.

The API would let you search through prior experimental results.

so you could load up the prior task, check the ids that showed up in output (maybe you save these as a separate artifact for faster load times), and only evaluate the new inputs. perhaps you copy over the old outputs to the new task for completeness.

that's how I'd approach it. use "data-creation" tasks and artifacts to roll your own logic for "caching" (skipping evaluation) within the task itself.

In the open source version, you don't get a whole lot (in my opinion) from using datasets over basic artifacts in tasks (scoped to just create a dataset). The real "power" in the datasets feature I believe come with some of the pro features.

  				
Posted 
	9 months ago

					More  		
  Report
		
					SmallTurkey79
				
					0
					 × 1

Write your answer

710 Views

2 Answers

9 months ago