feat: adding split_fileset functionality and Result output#1534
Open
hooloobooroodkoo wants to merge 14 commits intoscikit-hep:masterfrom
Open
feat: adding split_fileset functionality and Result output#1534hooloobooroodkoo wants to merge 14 commits intoscikit-hep:masterfrom
hooloobooroodkoo wants to merge 14 commits intoscikit-hep:masterfrom
Conversation
for more information, see https://pre-commit.ci
Collaborator
|
I'd say we should probably proceed with this one instead of #1533 but make this new result type opt-in via a keyword argument in |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1532
a. This PR adds two simple functions to the
coffea.dataset_toolsmodule:split_filesetandhash_fileset.The idea is to give users more control over how they execute analysis on a fileset, and to get a partial result instead of nothing if something breaks (e.g., one file is broken). If this PR is accepted and merged (or the alternative one I'm going to open next), I'll write documentation with a kind of "Best practices" guide on how to use it, with the examples below.
split_fileset()allows choosing a strategy for how to split the fileset into parts (~chunks, but higher-level chunks — not the usual coffea chunks; one chunk will be a unique subset of files from the fileset). It returns a list of partial filesets. This function accepts:fileset: {dataset: {"files": {path: treename, ...}}}strategy: "by_dataset" — one dataset per one chunk; None — all datasets togetherpercentage: an integer that divides 100 evenly (20, 25, 50...). Ifstrategy="by_dataset", each dataset is split into100/percentagechunks; otherwise the whole fileset is split into100/percentagechunks where each chunk gets that percentage of each dataset's filesdatasets: list, callable or tuple of dataset namesThis gives users the ability to write analysis like this:
If one or several chunks contained a broken file etc., a partial result will still be returned.
hash_filesetallows creating a unique filename for a processed chunk based on its file paths and dataset names. This is useful when you want to preserve partial results and only rerun analysis on missing chunks. The recommended approach in a Jupyter notebook, for example, would be:After the first run, partial result will be saved. Then user can just rerun the same cell and partial result will be extracted from cache, while processor run will be only applied to missing chunks.
b. This PR also introduces an optional Rust-inspired Result return type for
coffea.processor.Runner. Instead of returning Accumulatable or raising an error in case of failure, it now can returnResultobject that can be eitherOk(Accumulateble)orErr(Exception).In the Runner user should specify a new flag
use_result_type:Then the user can decide what to do with errors, and the code would look the following way:
New API