feat: implement persistent preprocessor cache#1498
feat: implement persistent preprocessor cache#1498Ian2327 wants to merge 13 commits intoscikit-hep:masterfrom
Conversation
|
Hi @Ian2327 can you resolve the conflicts and get the CI to pass? Thanks! |
|
@Ian2327 Could you please add some unit/integration testing to ensure consistent functioning? Thanks! |
for more information, see https://pre-commit.ci
|
Tests failing due to the new hist release ( https://github.com/scikit-hep/hist/releases/tag/v2.9.1 ) |
|
Hi @Ian2327, I think we should not enforce this and create pickle files for people without them asking. Can you make this an optional feature? It's probably best if people opt into this and manually provide a path for the pickle file. |
|
You are overcomplicating it. There is no need to have a boolean flag and a string path and a path object. You just have an attribute of the Runner class that is either a None or a string. If None, no metadata is cached. If a string, it can be an absolute or a relative path and it uses that location to cache the metadata. |
Problem:
The current Coffea preprocessor repeated recomputes the same preprocessed ROOT data on every run, and considering the infrequency that these files tend to change, it causes unnecessary CPU use and longer overall processing time. Preprocessing can also be a large portion of the total processing time and since the processing stage can only commence after preprocessing has finished, if using a cluster with N workers, there is a high likelihood that many workers will be idling while waiting for the remaining workers to finish the preprocessing step.
Proposed Solution:
Introduce a persistent, file-backed cache to the preprocessor stage.
Key features in this PR:
.coffea_metadata_cache.pkl) stores per-file metadata after first accessImpact:

The persistent cache significantly reduces preprocessing time as shown in the graph below which is run on the UGE cluster. The left side shows a run with the original code, and the right side shows the same run with a filled persistent cache which clocks the preprocessing time to just under 6 milliseconds.
Feedback:
Before merging, any feedback on any aspect of the cache such as cache location/naming, eviction policy, configuration options, etc. would be greatly appreciated.