Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Could I Get Some Feedback From People With Experience Using Clearml Pipelines On The Best Way To Handle Caching? My Team Is Working On Configuring Clearml Pipelines For Our Data Processing Workflow. We Currently Have An Experimental Pipeline Configured F

Could I get some feedback from people with experience using ClearML pipelines on the best way to handle caching? My team is working on configuring ClearML Pipelines for our data processing workflow.

We currently have an experimental pipeline configured for batch data processing. It runs a basic algorithm on each item provided as input, essentially just mapping each input piece of data to a new, processed output. However the algorithm we run is somewhat expensive, and we want to be able to cache as much computation as possible. If we run the pipeline with 1000 items from our ClearML Data dataset, and then add another item, when we re-run the pipeline with those 1001 items as input, we want to be able to cache all the previous computation and only have to process the single new item.

As far as I can tell, the built-in ClearML pipeline cache features will re-run the entire pipeline step if the input changes at all, so when the new item is added the entire batch pipeline step will re-run with all 1001 items.

What’re the best practices for handling this with ClearML? I’d really appreciate any information anyone can share about their experiences with this. Thank you :)

  
  
Posted 3 months ago
Votes Newest

Answers 2


It sounds like you understand the limitations correctly.

As far as I know, it'd be up to you to write your own code that computes the delta between old and new and only re-process the new entries.

The API would let you search through prior experimental results.

so you could load up the prior task, check the ids that showed up in output (maybe you save these as a separate artifact for faster load times), and only evaluate the new inputs. perhaps you copy over the old outputs to the new task for completeness.

that's how I'd approach it. use "data-creation" tasks and artifacts to roll your own logic for "caching" (skipping evaluation) within the task itself.

In the open source version, you don't get a whole lot (in my opinion) from using datasets over basic artifacts in tasks (scoped to just create a dataset). The real "power" in the datasets feature I believe come with some of the pro features.

  
  
Posted 3 months ago

Thank you, that’s super helpful! I’ll work on my own caching logic for tasks then. I appreciate all the information

  
  
Posted 3 months ago
315 Views
2 Answers
3 months ago
3 months ago
Tags