Skip to content

Limit EIA-861 years in the fast ETL#4568

Draft
zaneselvans wants to merge 64 commits intomainfrom
limit-eia861-fast-etl-years
Draft

Limit EIA-861 years in the fast ETL#4568
zaneselvans wants to merge 64 commits intomainfrom
limit-eia861-fast-etl-years

Conversation

@zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Aug 26, 2025

Overview

For a long time we have processed all years of the EIA-861 data in both the full and fast ETL. Initially this was a hack to get around the fact that some of the tables / columns are discontinued and so result in entirely null columns if you just process the last year or two of data, and our old data validation tests couldn't accommodate that expectation. But with #4105 / #4382 that's no longer a problem.

This PR switches to processing just a couple of years of data for the EIA-861 in the Fast ETL like we do with all of the other datasets. This will speed up the fast ETL a bit, and reduce the amount of data required for the CI.

Closes #2628

Documentation

Make sure to update relevant aspects of the documentation:

Testing

  • Rematerialized all my EIA-861 assets locally in both full and fast tests.
  • Ran dbt build --select "source:pudl.core_eia861*" --target etl-fast w/ fast outputs

To-do list

  • Update the EIA-861 transforms to allow only a few years to be processed.
  • Update the dbt expect_columns_not_all_null test parameters so they don't fail on just a few years of data.
  • Refactor BA association table repair in the FERC-714 outputs to not rely on always having all years of data.
  • Review the PR yourself and call out any questions or issues you have.

@zaneselvans zaneselvans added testing Writing tests, creating test data, automating testing, etc. ferc714 Anything having to do with FERC Form 714 eia861 Anything having to do with EIA Form 861 performance Make PUDL run faster! labels Aug 26, 2025
@zaneselvans zaneselvans requested a review from e-belfer August 26, 2025 02:27
@zaneselvans zaneselvans self-assigned this Aug 26, 2025
@zaneselvans zaneselvans moved this from New to In progress in Catalyst Megaproject Aug 26, 2025
@zaneselvans
Copy link
Member Author

@e-belfer The part of this I was already familiar with was a quick fix, but now in the FERC-714 Outputs I see that the BA association fixes depend on always having all years of EIA-861 data available, and they feel a little inscrutable to me. I don't know if any of that context is still rattling around inside your brain but if so, do you have any thoughts on how this might be refactored to work cleanly when there's only a subset of the EIA-861 years available?

@e-belfer
Copy link
Member

@e-belfer The part of this I was already familiar with was a quick fix, but now in the FERC-714 Outputs I see that the BA association fixes depend on always having all years of EIA-861 data available, and they feel a little inscrutable to me. I don't know if any of that context is still rattling around inside your brain but if so, do you have any thoughts on how this might be refactored to work cleanly when there's only a subset of the EIA-861 years available?

I've never looked at this part of the code before, so I am equally as lost as you are! We could probably add more conditionals in there but it'd make it even jankier, so maybe that section of the code is wanting a more thorough refactor? I don't however have an immediate insight into the best way forward.

@zaneselvans
Copy link
Member Author

@e-belfer Oh weird. Git was saying that all of this code came from you. But looking at the actual PR #2550 it looks like it was a mass cut-and-paste in the general Dagster re-organization.

The ASSOCIATIONS defining the fixes were created by Ethan, so maybe this is actually his code. So I guess we just have to figure it out. But it does seem pretty brittle / complex as it is.

@zaneselvans zaneselvans requested review from jdangerx and krivard and removed request for e-belfer August 27, 2025 16:05
Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor nudges. The ferc714 bits are a real trip.

It's not clear to me where data_maturity should get added to empty data frames, but we probably shouldn't try to do it twice?

Comment on lines 555 to 558
# If we're only processing some years of data, we may have entirely empty dataframes
# in the extraction phase, in which case the data_maturity field doesn't get added.
if "data_maturity" not in df.columns and len(df) == 0:
df["data_maturity"] = pd.NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, because this function is already here, so we might as well use it: putting an extraction step in the transform module is a weird architectural choice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I poked at trying to add data_maturity in the extract where it's being set for non-empty dataframes, and didn't see an ideal place -- since the code that does that work doesn't get run at all if the dataframe is going to end up empty. Instead I moved it into the _pre_process() function in this module, which means it doesn't need to be anywhere else in here. The only other changes now are to accommodate the possibility of having zero rows in division.

Comment on lines 1487 to 1489
# This happens if the extracted dataframe was empty, as is the case in the fast ETL.
if "data_maturity" not in transformed_dsm1.columns:
transformed_dsm1["data_maturity"] = pd.NA
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't _post_process handle this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, there's an explicit reference to the data_maturity field within the transform function, so it needs to happen before we get to _post_process.

I suspect the "right" place to ensure that data maturity is added (even in the case of empty dataframes) is in the EIA Excel extractor... Maybe I should look into that more deeply.

dfi = df.set_index(index)
# Prepare reference rows
keys = [(fix["id"], pd.Timestamp(fix["from"], 1, 1)) for fix in ASSOCIATIONS]
eia861_years = df["report_date"].dt.year.unique()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding an early exit here if eia861_years is empty

Comment on lines +198 to +202
keys = [
(fix["id"], pd.Timestamp(fix["from"], 1, 1))
for fix in ASSOCIATIONS
if fix["from"] in eia861_years
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider defining a local version of ASSOCIATIONS to use both here and at line 209

refs, [fx for fx in ASSOCIATIONS if fx["from"] in eia861_years], strict=True
):
for year in range(fix["to"][0], fix["to"][1] + 1):
key = (fix["id"], pd.Timestamp(year, 1, 1))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to skip iterations here when year is not in eia861_years, instead of filtering at 216?

@zaneselvans
Copy link
Member Author

Anybody else looking at this -- the FERC-714 stuff doesn't actually work right now. I tried to make some little tweaks, but eventually came to understand that a bigger refactor of the spot-fix application (which deeply depends on having all years of data available) is necessary to make this work (and probably desirable just in general since it's so complex as it is)

@e-belfer e-belfer self-assigned this Dec 23, 2025
@e-belfer
Copy link
Member

The ASSOCIATIONS method originates from #618, and was created in #881. I'm planning on working backwards and converting these to manual spot-fixes targeting the values we want to include, rather than back or forward filling based on different years of the tables.

@zaneselvans zaneselvans moved this from Backlog to In progress in Catalyst Megaproject Dec 23, 2025
@e-belfer e-belfer moved this from In progress to Backlog in Catalyst Megaproject Jan 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eia861 Anything having to do with EIA Form 861 ferc714 Anything having to do with FERC Form 714 performance Make PUDL run faster! testing Writing tests, creating test data, automating testing, etc.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Enable filtering by year for EIA 861 and FERC 714 ETLs

3 participants

Comments