feat: pydantic preprocessing #1528
Conversation
There was a problem hiding this comment.
Pull request overview
This PR continues the migration of dataset preprocessing toward Pydantic-based DataGroupSpec/DatasetSpec inputs, adding parquet preprocessing support and retaining a legacy dict-based ROOT preprocessing path via an escape hatch.
Changes:
- Split preprocessing into legacy dict-based (
preprocess_legacy) and Pydantic-based root/parquet (preprocess_root,preprocess_parquet) implementations, withpreprocessorchestrating coercion and dispatch. - Extend preprocessing to support parquet inputs (form/uuid/steps extraction via parquet metadata).
- Update tests and fixtures to cover legacy vs Pydantic paths and new form key behaviors.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/coffea/dataset_tools/preprocess.py |
Introduces legacy vs Pydantic preprocessing split, adds parquet preprocessing, and updates form key handling. |
src/coffea/dataset_tools/filespec.py |
Updates DatasetSpec ingestion/promotion logic and adds equality override. |
src/coffea/dataset_tools/manipulations.py |
Adjusts filter_files to preserve/return PreprocessedFiles for Pydantic datasets. |
src/coffea/dataset_tools/__init__.py |
Exports the newly introduced preprocessing entry points. |
src/coffea/nanoevents/factory.py |
Adds compatibility fallback for dask opener call signatures. |
tests/test_dataset_tools.py |
Expands test matrix for legacy vs Pydantic preprocess behavior and save_form handling. |
tests/samples/fileset_with_empty_files_compressed_form_base.json |
Adds a compressed-form fixture used by preprocessing tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@nsmith- @lgray Tests pass locally so hopefully the same in CI, then this is ready for human review. I added a pass via Claude that caught a few additional improvement opportunities. @ikrommyd also welcome to have a look I've left in one update for NanoEventsFactory from my much older experimentation to restore parquet paths (most of which Iason added in along the way), though it's not complete in restoring parquet-dask mode (removing the NotImplementedError would give you a dak array, without the form-mapping properly applied). Is awkward-zipper far enough along that it could be experimentally used for this path? As we know there's no users of this right now. This PR doesn't try theading pydantic classes through the Runner as an option (could be a nice mini-project to bypass preprocessing via already Preprocessed pydantic types), nor does it update any Factory interfaces for inputs with parquet or pydantic filespecs, which could be warranted. Could also push further in the direction of method-chaining with e.g. |
… determining the form, steps, uuid if available
…versions, with appropriate options and doc strings
…ze with explicit preprocess_legacy_root and with legacy and pydantic inputs
…t; preprocess dispatches to legacy or pydantic appropriately
…ompressed_form'; some manipulation for when pydantic filespec is processed via legacy_preprocess_root function
…is stable, so define new equality function that skips the compressed_form
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…asets when recombining root and parquet mixed DataGroupSpec DatasetSpecs
967e81a to
48458c3
Compare
This continues the series of PRs to make pydantic input models thread through coffea, and specifically enables preprocessing parquet files via a pydantic input path. The old preprocess is renamed preprocess_legacy, and preprocess now attempts to coerce users to pydantic classes (with an escape hatch, preprocess_legacy_root=True). The pydantic preprocess is made a nonpublic function, with two user-facing variants for root and parquet which call it appropriately.
This partially pulls in updates from https://github.com/NJManganelli/coffea/tree/datafactory_parquet and some associated prototyping in another branch which did not get incorporated into #1403
This walks back changes to preprocess, to make the code friendlier and make it easier should we deprecate the dict-based preprocess in the future
One known fix for parquet nanoevents is incorporated, but the remainder of changes needed to support nanoevents is left to followup PRs to avoid too much scope creep