feat: pydantic preprocessing by NJManganelli · Pull Request #1528 · scikit-hep/coffea

NJManganelli · 2026-03-09T21:11:00Z

This continues the series of PRs to make pydantic input models thread through coffea, and specifically enables preprocessing parquet files via a pydantic input path. The old preprocess is renamed preprocess_legacy, and preprocess now attempts to coerce users to pydantic classes (with an escape hatch, preprocess_legacy_root=True). The pydantic preprocess is made a nonpublic function, with two user-facing variants for root and parquet which call it appropriately.

This partially pulls in updates from https://github.com/NJManganelli/coffea/tree/datafactory_parquet and some associated prototyping in another branch which did not get incorporated into #1403

This walks back changes to preprocess, to make the code friendlier and make it easier should we deprecate the dict-based preprocess in the future

One known fix for parquet nanoevents is incorporated, but the remainder of changes needed to support nanoevents is left to followup PRs to avoid too much scope creep

Copilot

Pull request overview

This PR continues the migration of dataset preprocessing toward Pydantic-based DataGroupSpec/DatasetSpec inputs, adding parquet preprocessing support and retaining a legacy dict-based ROOT preprocessing path via an escape hatch.

Changes:

Split preprocessing into legacy dict-based (preprocess_legacy) and Pydantic-based root/parquet (preprocess_root, preprocess_parquet) implementations, with preprocess orchestrating coercion and dispatch.
Extend preprocessing to support parquet inputs (form/uuid/steps extraction via parquet metadata).
Update tests and fixtures to cover legacy vs Pydantic paths and new form key behaviors.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/coffea/dataset_tools/preprocess.py`	Introduces legacy vs Pydantic preprocessing split, adds parquet preprocessing, and updates form key handling.
`src/coffea/dataset_tools/filespec.py`	Updates DatasetSpec ingestion/promotion logic and adds equality override.
`src/coffea/dataset_tools/manipulations.py`	Adjusts `filter_files` to preserve/return `PreprocessedFiles` for Pydantic datasets.
`src/coffea/dataset_tools/__init__.py`	Exports the newly introduced preprocessing entry points.
`src/coffea/nanoevents/factory.py`	Adds compatibility fallback for dask opener call signatures.
`tests/test_dataset_tools.py`	Expands test matrix for legacy vs Pydantic preprocess behavior and save_form handling.
`tests/samples/fileset_with_empty_files_compressed_form_base.json`	Adds a compressed-form fixture used by preprocessing tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/coffea/dataset_tools/preprocess.py

tests/test_dataset_tools.py

src/coffea/nanoevents/factory.py

src/coffea/dataset_tools/filespec.py

src/coffea/dataset_tools/preprocess.py

NJManganelli · 2026-03-10T22:27:42Z

@nsmith- @lgray Tests pass locally so hopefully the same in CI, then this is ready for human review. I added a pass via Claude that caught a few additional improvement opportunities. @ikrommyd also welcome to have a look

I've left in one update for NanoEventsFactory from my much older experimentation to restore parquet paths (most of which Iason added in along the way), though it's not complete in restoring parquet-dask mode (removing the NotImplementedError would give you a dak array, without the form-mapping properly applied). Is awkward-zipper far enough along that it could be experimentally used for this path? As we know there's no users of this right now.

This PR doesn't try theading pydantic classes through the Runner as an option (could be a nice mini-project to bypass preprocessing via already Preprocessed pydantic types), nor does it update any Factory interfaces for inputs with parquet or pydantic filespecs, which could be warranted. Could also push further in the direction of method-chaining with e.g. DatasetSpec.events(schemaclass=NanoAODSchema, mode="virtual").apply.... Regardless, any and all such additions are left to other PRs

… determining the form, steps, uuid if available

…versions, with appropriate options and doc strings

… as possible

…ze with explicit preprocess_legacy_root and with legacy and pydantic inputs

…t; preprocess dispatches to legacy or pydantic appropriately

…ompressed_form'; some manipulation for when pydantic filespec is processed via legacy_preprocess_root function

…is stable, so define new equality function that skips the compressed_form

…th, save_form

for more information, see https://pre-commit.ci

…asets when recombining root and parquet mixed DataGroupSpec DatasetSpecs

NJManganelli requested a review from Copilot March 9, 2026 21:11

Copilot started reviewing on behalf of NJManganelli March 9, 2026 21:11 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

Nick Manganelli and others added 26 commits April 2, 2026 21:12

Uniformize parquet handling by adding the equivalent of functions for…

e872b50

… determining the form, steps, uuid if available

Pieces missed from commit 4b87500 in datafactory_parquet branch

a781e27

The try-except block for mapping when utilizing parquet

0acc6bc

pre-commit fixes

407aa65

Start adapting to pydantic path only in _preprocess_parquet

3fea830

DataGroupSpec only input for parquet preprocessing

5847ee6

pre-commit fixes

9760685

break preprocess into legacy, parquet, root, and general-dispatching …

27ea379

…versions, with appropriate options and doc strings

fixup formatting

b28c6f2

The nasty part: handling different legacy input formats as seamlessly…

67b3d1d

… as possible

form vs compressed_form, update tests for preprocessing and parametri…

c02dc50

…ze with explicit preprocess_legacy_root and with legacy and pydantic inputs

Expose preprocess_legacy_root, preprocess_root, and preprocess_parque…

c4fc75a

…t; preprocess dispatches to legacy or pydantic appropriately

Any color you like

cee990f

So long as it's ...

4747e8a

Fix some promotion logic in filespec.py

1941db2

PreprocessedFiles is the correct function with demotion to InputFiles

86686a2

Make get_steps work with backcompat mode or updated expectation of 'c…

560b2ef

…ompressed_form'; some manipulation for when pydantic filespec is processed via legacy_preprocess_root function

compressed_form is not 100% deterministic, although the decoded form …

20ed0a8

…is stable, so define new equality function that skips the compressed_form

Update tests for empty files with most variations of preprocessing pa…

0d45428

…th, save_form

[pre-commit.ci] auto fixes from pre-commit.com hooks

232d343

for more information, see https://pre-commit.ci

fix form --> compressed_form in empty files

84246ad

copilot's fix for mistake in __eq__

14569ac

Just delete unused assertions

355b636

Give up on mixed pydantic classes fully for now

a1560e6

docstring fix from copilot

53264bd

[pre-commit.ci] auto fixes from pre-commit.com hooks

3b21f05

for more information, see https://pre-commit.ci

Nick Manganelli and others added 6 commits April 2, 2026 21:12

Adopt suggested DeprecationWarning style

9b84882

Copilot recommended alt to using try-Except block

73afde2

[pre-commit.ci] auto fixes from pre-commit.com hooks

8058cbd

for more information, see https://pre-commit.ci

compress_form fallback bug

d7ba3a3

Claude: restrict Exceptions in filespec

6cd7729

Claude: pydantic docstring, fix for potential KeyError with empty dat…

48458c3

…asets when recombining root and parquet mixed DataGroupSpec DatasetSpecs

NJManganelli force-pushed the pydantic_preprocessing_rebase branch from 967e81a to 48458c3 Compare April 3, 2026 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pydantic preprocessing #1528

feat: pydantic preprocessing #1528
NJManganelli wants to merge 32 commits intoscikit-hep:masterfrom
NJManganelli:pydantic_preprocessing_rebase

NJManganelli commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NJManganelli commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

NJManganelli commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NJManganelli commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants