feat: implement persistent preprocessor cache by Ian2327 · Pull Request #1498 · scikit-hep/coffea

Ian2327 · 2025-12-09T18:13:00Z

Problem:
The current Coffea preprocessor repeated recomputes the same preprocessed ROOT data on every run, and considering the infrequency that these files tend to change, it causes unnecessary CPU use and longer overall processing time. Preprocessing can also be a large portion of the total processing time and since the processing stage can only commence after preprocessing has finished, if using a cluster with N workers, there is a high likelihood that many workers will be idling while waiting for the remaining workers to finish the preprocessing step.

Proposed Solution:
Introduce a persistent, file-backed cache to the preprocessor stage.
Key features in this PR:

A pickle-backed cache (.coffea_metadata_cache.pkl) stores per-file metadata after first access
Data are automatically reused on subsequent runs if no changes to file metadata are detected
Cache is loaded on Runner class initialization and written atomically after preprocessing
Fallback to existing in-memory cache is retained if the persistent cache cannot be loaded or has not been created

Impact:
The persistent cache significantly reduces preprocessing time as shown in the graph below which is run on the UGE cluster. The left side shows a run with the original code, and the right side shows the same run with a filled persistent cache which clocks the preprocessing time to just under 6 milliseconds.

Feedback:
Before merging, any feedback on any aspect of the cache such as cache location/naming, eviction policy, configuration options, etc. would be greatly appreciated.

nsmith- · 2025-12-09T20:48:58Z

This may address some of the scope of issue #1386

You may also be interested in the coffea.compute refactor #1470

nsmith- · 2025-12-15T14:22:35Z

Hi @Ian2327 can you resolve the conflicts and get the CI to pass? Thanks!

for more information, see https://pre-commit.ci

lgray · 2025-12-17T15:55:58Z

@Ian2327 Could you please add some unit/integration testing to ensure consistent functioning? Thanks!

for more information, see https://pre-commit.ci

nsmith- · 2026-01-12T14:43:01Z

Tests failing due to the new hist release ( https://github.com/scikit-hep/hist/releases/tag/v2.9.1 )

ikrommyd · 2026-02-23T14:46:02Z

Hi @Ian2327, I think we should not enforce this and create pickle files for people without them asking. Can you make this an optional feature? It's probably best if people opt into this and manually provide a path for the pickle file.
Also this currently has extremely extensive testing. Just a single simple test is enough for a feature like this.

for more information, see https://pre-commit.ci

ikrommyd · 2026-02-23T17:48:04Z

You are overcomplicating it. There is no need to have a boolean flag and a string path and a path object. You just have an attribute of the Runner class that is either a None or a string. If None, no metadata is cached. If a string, it can be an absolute or a relative path and it uses that location to cache the metadata.

for more information, see https://pre-commit.ci

feat: implement persistent preprocessor cache

ed5c340

Ian2327 and others added 3 commits December 15, 2025 09:44

Merge branch 'master' into persistent_preprocessor_cache

4ffdfc3

[pre-commit.ci] auto fixes from pre-commit.com hooks

f0b2e21

for more information, see https://pre-commit.ci

Merge branch 'master' into persistent_preprocessor_cache

d2de541

Ian Setia and others added 2 commits December 24, 2025 02:39

Added unit & integration tests

01bfbb1

[pre-commit.ci] auto fixes from pre-commit.com hooks

a6eba6d

for more information, see https://pre-commit.ci

Ian2327 and others added 2 commits January 12, 2026 16:16

Merge branch 'master' into persistent_preprocessor_cache

5157735

Merge branch 'master' into persistent_preprocessor_cache

8f20754

Ian Setia and others added 2 commits February 23, 2026 12:18

Added bool flag to Runner class to make pickle caching optional

087154e

[pre-commit.ci] auto fixes from pre-commit.com hooks

b96cd0e

for more information, see https://pre-commit.ci

Ian Setia and others added 3 commits February 23, 2026 22:31

Simplified to single string with file name with default to None

88a3cae

merge

5cf3bc0

[pre-commit.ci] auto fixes from pre-commit.com hooks

138a190

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement persistent preprocessor cache#1498

feat: implement persistent preprocessor cache#1498
Ian2327 wants to merge 13 commits intoscikit-hep:masterfrom
Ian2327:persistent_preprocessor_cache

Ian2327 commented Dec 9, 2025

Uh oh!

nsmith- commented Dec 9, 2025

Uh oh!

nsmith- commented Dec 15, 2025

Uh oh!

lgray commented Dec 17, 2025

Uh oh!

nsmith- commented Jan 12, 2026

Uh oh!

ikrommyd commented Feb 23, 2026

Uh oh!

ikrommyd commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Ian2327 commented Dec 9, 2025

Uh oh!

nsmith- commented Dec 9, 2025

Uh oh!

nsmith- commented Dec 15, 2025

Uh oh!

lgray commented Dec 17, 2025

Uh oh!

nsmith- commented Jan 12, 2026

Uh oh!

ikrommyd commented Feb 23, 2026

Uh oh!

ikrommyd commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ikrommyd commented Feb 23, 2026 •

edited

Loading