Limit EIA-861 years in the fast ETL by zaneselvans · Pull Request #4568 · catalyst-cooperative/pudl

zaneselvans · 2025-08-26T02:27:15Z

Overview

For a long time we have processed all years of the EIA-861 data in both the full and fast ETL. Initially this was a hack to get around the fact that some of the tables / columns are discontinued and so result in entirely null columns if you just process the last year or two of data, and our old data validation tests couldn't accommodate that expectation. But with #4105 / #4382 that's no longer a problem.

This PR switches to processing just a couple of years of data for the EIA-861 in the Fast ETL like we do with all of the other datasets. This will speed up the fast ETL a bit, and reduce the amount of data required for the CI.

Closes #2628

Documentation

Make sure to update relevant aspects of the documentation:

Update the release notes: reference the PR and related issues.

Testing

Rematerialized all my EIA-861 assets locally in both full and fast tests.
Ran dbt build --select "source:pudl.core_eia861*" --target etl-fast w/ fast outputs

To-do list

Update the EIA-861 transforms to allow only a few years to be processed.
Update the dbt expect_columns_not_all_null test parameters so they don't fail on just a few years of data.
Refactor BA association table repair in the FERC-714 outputs to not rely on always having all years of data.
Review the PR yourself and call out any questions or issues you have.

…iting EIA-861 years in fast ETL.

…data.

zaneselvans · 2025-08-26T02:34:46Z

@e-belfer The part of this I was already familiar with was a quick fix, but now in the FERC-714 Outputs I see that the BA association fixes depend on always having all years of EIA-861 data available, and they feel a little inscrutable to me. I don't know if any of that context is still rattling around inside your brain but if so, do you have any thoughts on how this might be refactored to work cleanly when there's only a subset of the EIA-861 years available?

e-belfer · 2025-08-26T13:56:41Z

@e-belfer The part of this I was already familiar with was a quick fix, but now in the FERC-714 Outputs I see that the BA association fixes depend on always having all years of EIA-861 data available, and they feel a little inscrutable to me. I don't know if any of that context is still rattling around inside your brain but if so, do you have any thoughts on how this might be refactored to work cleanly when there's only a subset of the EIA-861 years available?

I've never looked at this part of the code before, so I am equally as lost as you are! We could probably add more conditionals in there but it'd make it even jankier, so maybe that section of the code is wanting a more thorough refactor? I don't however have an immediate insight into the best way forward.

zaneselvans · 2025-08-26T14:55:59Z

@e-belfer Oh weird. Git was saying that all of this code came from you. But looking at the actual PR #2550 it looks like it was a mass cut-and-paste in the general Dagster re-organization.

The ASSOCIATIONS defining the fixes were created by Ethan, so maybe this is actually his code. So I guess we just have to figure it out. But it does seem pretty brittle / complex as it is.

krivard

Mostly minor nudges. The ferc714 bits are a real trip.

It's not clear to me where data_maturity should get added to empty data frames, but we probably shouldn't try to do it twice?

krivard · 2025-08-27T20:02:40Z

src/pudl/transform/eia861.py

+    # If we're only processing some years of data, we may have entirely empty dataframes
+    # in the extraction phase, in which case the data_maturity field doesn't get added.
+    if "data_maturity" not in df.columns and len(df) == 0:
+        df["data_maturity"] = pd.NA


nit, because this function is already here, so we might as well use it: putting an extraction step in the transform module is a weird architectural choice

I poked at trying to add data_maturity in the extract where it's being set for non-empty dataframes, and didn't see an ideal place -- since the code that does that work doesn't get run at all if the dataframe is going to end up empty. Instead I moved it into the _pre_process() function in this module, which means it doesn't need to be anywhere else in here. The only other changes now are to accommodate the possibility of having zero rows in division.

src/pudl/transform/eia861.py

krivard · 2025-08-27T20:09:58Z

src/pudl/transform/eia861.py

+    # This happens if the extracted dataframe was empty, as is the case in the fast ETL.
+    if "data_maturity" not in transformed_dsm1.columns:
+        transformed_dsm1["data_maturity"] = pd.NA


Doesn't _post_process handle this?

Unfortunately, there's an explicit reference to the data_maturity field within the transform function, so it needs to happen before we get to _post_process.

I suspect the "right" place to ensure that data maturity is added (even in the case of empty dataframes) is in the EIA Excel extractor... Maybe I should look into that more deeply.

krivard · 2025-08-27T20:18:51Z

src/pudl/output/ferc714.py

    dfi = df.set_index(index)
    # Prepare reference rows
-    keys = [(fix["id"], pd.Timestamp(fix["from"], 1, 1)) for fix in ASSOCIATIONS]
+    eia861_years = df["report_date"].dt.year.unique()


Consider adding an early exit here if eia861_years is empty

krivard · 2025-08-27T20:23:01Z

src/pudl/output/ferc714.py

+    keys = [
+        (fix["id"], pd.Timestamp(fix["from"], 1, 1))
+        for fix in ASSOCIATIONS
+        if fix["from"] in eia861_years
+    ]


Consider defining a local version of ASSOCIATIONS to use both here and at line 209

krivard · 2025-08-27T20:24:21Z

src/pudl/output/ferc714.py

+        refs, [fx for fx in ASSOCIATIONS if fx["from"] in eia861_years], strict=True
+    ):
        for year in range(fix["to"][0], fix["to"][1] + 1):
            key = (fix["id"], pd.Timestamp(year, 1, 1))


would it make sense to skip iterations here when year is not in eia861_years, instead of filtering at 216?

zaneselvans · 2025-08-27T23:24:52Z

Anybody else looking at this -- the FERC-714 stuff doesn't actually work right now. I tried to make some little tweaks, but eventually came to understand that a bigger refactor of the spot-fix application (which deeply depends on having all years of data available) is necessary to make this work (and probably desirable just in general since it's so complex as it is)

…perative/pudl into limit-eia861-fast-etl-years

e-belfer · 2025-12-23T21:49:15Z

The ASSOCIATIONS method originates from #618, and was created in #881. I'm planning on working backwards and converting these to manual spot-fixes targeting the values we want to include, rather than back or forward filling based on different years of the tables.

zaneselvans added 4 commits August 25, 2025 15:02

Restrict EIA-861 years to 2020 & 2023 in fast ETL.

2c23157

Adjust EIA-861 transforms + dbt cols_not_all_null params to allow lim…

37ee8cf

…iting EIA-861 years in fast ETL.

Update comment in ETL Fast settings wrt EIA-861 years

a2449b8

WIP: rafactor FERC-714 outputs to not depend on all years of EIA-861 …

4540b26

…data.

zaneselvans added this to Catalyst Megaproject Aug 26, 2025

zaneselvans added testing Writing tests, creating test data, automating testing, etc. ferc714 Anything having to do with FERC Form 714 eia861 Anything having to do with EIA Form 861 performance Make PUDL run faster! labels Aug 26, 2025

github-project-automation bot moved this to New in Catalyst Megaproject Aug 26, 2025

zaneselvans requested a review from e-belfer August 26, 2025 02:27

zaneselvans self-assigned this Aug 26, 2025

zaneselvans moved this from New to In progress in Catalyst Megaproject Aug 26, 2025

zaneselvans added 2 commits August 26, 2025 14:37

Merge branch 'main' into limit-eia861-fast-etl-years

c4dac29

Merge branch 'main' into limit-eia861-fast-etl-years

93258e9

zaneselvans requested review from jdangerx and krivard and removed request for e-belfer August 27, 2025 16:05

krivard reviewed Aug 27, 2025

View reviewed changes

Merge branch 'main' into limit-eia861-fast-etl-years

fcaec32

zaneselvans added 6 commits August 28, 2025 00:14

Merge branch 'main' into limit-eia861-fast-etl-years

36cb979

Merge branch 'main' into limit-eia861-fast-etl-years

9af11c0

Merge branch 'limit-eia861-fast-etl-years' of github.com:catalyst-coo…

ce7411c

…perative/pudl into limit-eia861-fast-etl-years

Merge branch 'main' into limit-eia861-fast-etl-years

4c2d06c

Merge branch 'main' into limit-eia861-fast-etl-years

4b4d291

Merge branch 'main' into limit-eia861-fast-etl-years

8b30f10

zaneselvans and others added 19 commits October 22, 2025 14:40

Merge branch 'main' into limit-eia861-fast-etl-years

5d7945f

Merge branch 'main' into limit-eia861-fast-etl-years

88e826b

Merge branch 'main' into limit-eia861-fast-etl-years

64c8605

Merge branch 'main' into limit-eia861-fast-etl-years

b6ce11b

Merge branch 'main' into limit-eia861-fast-etl-years

0f798e9

Merge branch 'main' into limit-eia861-fast-etl-years

bff06f9

Merge branch 'main' into limit-eia861-fast-etl-years

42cdeb5

Merge branch 'main' into limit-eia861-fast-etl-years

678d720

Merge branch 'main' into limit-eia861-fast-etl-years

9f50082

Merge branch 'main' into limit-eia861-fast-etl-years

a2f6833

Merge branch 'main' into limit-eia861-fast-etl-years

19d2584

Merge branch 'main' into limit-eia861-fast-etl-years

ebd0707

Merge branch 'main' into limit-eia861-fast-etl-years

9cc483f

Merge branch 'main' into limit-eia861-fast-etl-years

cd9ebae

Merge branch 'main' into limit-eia861-fast-etl-years

aef4e73

Merge branch 'main' into limit-eia861-fast-etl-years

798fbe5

Merge branch 'main' into limit-eia861-fast-etl-years

d4e50bc

Merge branch 'main' into limit-eia861-fast-etl-years

eae4f04

Merge branch 'main' into limit-eia861-fast-etl-years

627d065

e-belfer self-assigned this Dec 23, 2025

zaneselvans moved this from Backlog to In progress in Catalyst Megaproject Dec 23, 2025

zaneselvans added 5 commits January 6, 2026 14:33

Merge branch 'main' into limit-eia861-fast-etl-years

8013eb8

Merge branch 'main' into limit-eia861-fast-etl-years

d8b25e9

Merge branch 'main' into limit-eia861-fast-etl-years

ef63feb

Merge branch 'main' into limit-eia861-fast-etl-years

7cfab00

Merge branch 'main' into limit-eia861-fast-etl-years

57de223

e-belfer moved this from In progress to Backlog in Catalyst Megaproject Jan 14, 2026

zaneselvans added 2 commits January 15, 2026 11:50

Merge branch 'main' into limit-eia861-fast-etl-years

26bfd8d

Merge branch 'main' into limit-eia861-fast-etl-years

0606b8c

Uh oh!

Conversation

zaneselvans commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Documentation

Testing

To-do list

Uh oh!

zaneselvans commented Aug 26, 2025

Uh oh!

e-belfer commented Aug 26, 2025

Uh oh!

zaneselvans commented Aug 26, 2025

Uh oh!

krivard left a comment

Choose a reason for hiding this comment

Uh oh!

krivard Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

zaneselvans Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krivard Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

zaneselvans Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

krivard Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

krivard Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

krivard Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

zaneselvans commented Aug 27, 2025

Uh oh!

e-belfer commented Dec 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

zaneselvans commented Aug 26, 2025 •

edited

Loading