Conversation
…iting EIA-861 years in fast ETL.
|
@e-belfer The part of this I was already familiar with was a quick fix, but now in the FERC-714 Outputs I see that the BA association fixes depend on always having all years of EIA-861 data available, and they feel a little inscrutable to me. I don't know if any of that context is still rattling around inside your brain but if so, do you have any thoughts on how this might be refactored to work cleanly when there's only a subset of the EIA-861 years available? |
I've never looked at this part of the code before, so I am equally as lost as you are! We could probably add more conditionals in there but it'd make it even jankier, so maybe that section of the code is wanting a more thorough refactor? I don't however have an immediate insight into the best way forward. |
|
@e-belfer Oh weird. Git was saying that all of this code came from you. But looking at the actual PR #2550 it looks like it was a mass cut-and-paste in the general Dagster re-organization. The |
krivard
left a comment
There was a problem hiding this comment.
Mostly minor nudges. The ferc714 bits are a real trip.
It's not clear to me where data_maturity should get added to empty data frames, but we probably shouldn't try to do it twice?
src/pudl/transform/eia861.py
Outdated
| # If we're only processing some years of data, we may have entirely empty dataframes | ||
| # in the extraction phase, in which case the data_maturity field doesn't get added. | ||
| if "data_maturity" not in df.columns and len(df) == 0: | ||
| df["data_maturity"] = pd.NA |
There was a problem hiding this comment.
nit, because this function is already here, so we might as well use it: putting an extraction step in the transform module is a weird architectural choice
There was a problem hiding this comment.
I poked at trying to add data_maturity in the extract where it's being set for non-empty dataframes, and didn't see an ideal place -- since the code that does that work doesn't get run at all if the dataframe is going to end up empty. Instead I moved it into the _pre_process() function in this module, which means it doesn't need to be anywhere else in here. The only other changes now are to accommodate the possibility of having zero rows in division.
src/pudl/transform/eia861.py
Outdated
| # This happens if the extracted dataframe was empty, as is the case in the fast ETL. | ||
| if "data_maturity" not in transformed_dsm1.columns: | ||
| transformed_dsm1["data_maturity"] = pd.NA |
There was a problem hiding this comment.
Doesn't _post_process handle this?
There was a problem hiding this comment.
Unfortunately, there's an explicit reference to the data_maturity field within the transform function, so it needs to happen before we get to _post_process.
I suspect the "right" place to ensure that data maturity is added (even in the case of empty dataframes) is in the EIA Excel extractor... Maybe I should look into that more deeply.
| dfi = df.set_index(index) | ||
| # Prepare reference rows | ||
| keys = [(fix["id"], pd.Timestamp(fix["from"], 1, 1)) for fix in ASSOCIATIONS] | ||
| eia861_years = df["report_date"].dt.year.unique() |
There was a problem hiding this comment.
Consider adding an early exit here if eia861_years is empty
| keys = [ | ||
| (fix["id"], pd.Timestamp(fix["from"], 1, 1)) | ||
| for fix in ASSOCIATIONS | ||
| if fix["from"] in eia861_years | ||
| ] |
There was a problem hiding this comment.
Consider defining a local version of ASSOCIATIONS to use both here and at line 209
| refs, [fx for fx in ASSOCIATIONS if fx["from"] in eia861_years], strict=True | ||
| ): | ||
| for year in range(fix["to"][0], fix["to"][1] + 1): | ||
| key = (fix["id"], pd.Timestamp(year, 1, 1)) |
There was a problem hiding this comment.
would it make sense to skip iterations here when year is not in eia861_years, instead of filtering at 216?
|
Anybody else looking at this -- the FERC-714 stuff doesn't actually work right now. I tried to make some little tweaks, but eventually came to understand that a bigger refactor of the spot-fix application (which deeply depends on having all years of data available) is necessary to make this work (and probably desirable just in general since it's so complex as it is) |
…perative/pudl into limit-eia861-fast-etl-years
Overview
For a long time we have processed all years of the EIA-861 data in both the full and fast ETL. Initially this was a hack to get around the fact that some of the tables / columns are discontinued and so result in entirely null columns if you just process the last year or two of data, and our old data validation tests couldn't accommodate that expectation. But with #4105 / #4382 that's no longer a problem.
This PR switches to processing just a couple of years of data for the EIA-861 in the Fast ETL like we do with all of the other datasets. This will speed up the fast ETL a bit, and reduce the amount of data required for the CI.
Closes #2628
Documentation
Make sure to update relevant aspects of the documentation:
Testing
dbt build --select "source:pudl.core_eia861*" --target etl-fastw/ fast outputsTo-do list
expect_columns_not_all_nulltest parameters so they don't fail on just a few years of data.