Skip to content

feat: implement persistent preprocessor cache#1498

Open
Ian2327 wants to merge 13 commits intoscikit-hep:masterfrom
Ian2327:persistent_preprocessor_cache
Open

feat: implement persistent preprocessor cache#1498
Ian2327 wants to merge 13 commits intoscikit-hep:masterfrom
Ian2327:persistent_preprocessor_cache

Conversation

@Ian2327
Copy link
Copy Markdown

@Ian2327 Ian2327 commented Dec 9, 2025

Problem:
The current Coffea preprocessor repeated recomputes the same preprocessed ROOT data on every run, and considering the infrequency that these files tend to change, it causes unnecessary CPU use and longer overall processing time. Preprocessing can also be a large portion of the total processing time and since the processing stage can only commence after preprocessing has finished, if using a cluster with N workers, there is a high likelihood that many workers will be idling while waiting for the remaining workers to finish the preprocessing step.

Proposed Solution:
Introduce a persistent, file-backed cache to the preprocessor stage.
Key features in this PR:

  • A pickle-backed cache (.coffea_metadata_cache.pkl) stores per-file metadata after first access
  • Data are automatically reused on subsequent runs if no changes to file metadata are detected
  • Cache is loaded on Runner class initialization and written atomically after preprocessing
  • Fallback to existing in-memory cache is retained if the persistent cache cannot be loaded or has not been created

Impact:
The persistent cache significantly reduces preprocessing time as shown in the graph below which is run on the UGE cluster. The left side shows a run with the original code, and the right side shows the same run with a filled persistent cache which clocks the preprocessing time to just under 6 milliseconds.
image

Feedback:
Before merging, any feedback on any aspect of the cache such as cache location/naming, eviction policy, configuration options, etc. would be greatly appreciated.

@nsmith-
Copy link
Copy Markdown
Member

nsmith- commented Dec 9, 2025

This may address some of the scope of issue #1386

You may also be interested in the coffea.compute refactor #1470

@nsmith-
Copy link
Copy Markdown
Member

nsmith- commented Dec 15, 2025

Hi @Ian2327 can you resolve the conflicts and get the CI to pass? Thanks!

@lgray
Copy link
Copy Markdown
Collaborator

lgray commented Dec 17, 2025

@Ian2327 Could you please add some unit/integration testing to ensure consistent functioning? Thanks!

@nsmith-
Copy link
Copy Markdown
Member

nsmith- commented Jan 12, 2026

Tests failing due to the new hist release ( https://github.com/scikit-hep/hist/releases/tag/v2.9.1 )

@ikrommyd
Copy link
Copy Markdown
Collaborator

Hi @Ian2327, I think we should not enforce this and create pickle files for people without them asking. Can you make this an optional feature? It's probably best if people opt into this and manually provide a path for the pickle file.
Also this currently has extremely extensive testing. Just a single simple test is enough for a feature like this.

@ikrommyd
Copy link
Copy Markdown
Collaborator

ikrommyd commented Feb 23, 2026

You are overcomplicating it. There is no need to have a boolean flag and a string path and a path object. You just have an attribute of the Runner class that is either a None or a string. If None, no metadata is cached. If a string, it can be an absolute or a relative path and it uses that location to cache the metadata.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants