Skip to content

Comments

Add missing eia923 file mappings discovered in audit#4317

Draft
krivard wants to merge 3 commits intomainfrom
fix_eia923_mappings
Draft

Add missing eia923 file mappings discovered in audit#4317
krivard wants to merge 3 commits intomainfrom
fix_eia923_mappings

Conversation

@krivard
Copy link
Contributor

@krivard krivard commented Jun 10, 2025

Overview

What problem does this address?

While checking mappings for #4312 we discovered some pages of existing source spreadsheets weren't being mapped but sh/could have been. This PR adds those missing mappings back in.

NOTE: we're not currently extracting any of this data because it's not column-mapped...but it is file and pagemapped, and we seem to be maintaining those mappings, so this correction is still probably good to do

What did you change?

  • New file mappings for coal_stocks and petcoke_stocks for 2021, 2022, 2023
  • Fixed off-by-one error in page mappings for coal_stocks 2017

Documentation

Make sure to update relevant aspects of the documentation:

  • Update the release notes: reference the PR and related issues.
  • Update relevant Data Source jinja templates (see docs/data_sources/templates).
  • Update relevant table or source description metadata (see src/metadata).
  • Review and update any other aspects of the documentation that might be affected by this PR.

Testing

How did you make sure this worked? How can a reviewer verify this?

from pudl.extract.eia923 import Extractor as Extractor923
from pudl.workspace.datastore import Datastore
from pudl.workspace.setup import PudlPaths
from io import BytesIO
import pandas as pd

pp = PudlPaths()
ds = Datastore(local_cache_path=pp.data_dir)
ex = Extractor923(ds=ds)

pagehits = {}
names = pd.DataFrame(data={}, index=range(2001, 2026))
for yr in range(2001, 2026):
    for page in ex._metadata._file_name.index:
        source_filename = ex.source_filename(page, year=yr)
        if source_filename == "-1": continue
        if source_filename not in ex._file_cache:
            with ds.get_zipfile_resource("eia923", **ex.zipfile_resource_partitions(page, year=yr)) as zf:
                ex._file_cache[source_filename] = pd.ExcelFile(BytesIO(zf.read(source_filename)), engine="calamine")
        excelf = ex._file_cache[source_filename]
        pageid = ex._metadata._sheet_name.at[page, str(yr)]
        if source_filename not in pagehits: pagehits[source_filename] = { sni: False for sni in excelf.sheet_names }
        if pageid != "-1": pagehits[source_filename][excelf.sheet_names[pageid]] = True
        names.loc[yr, f"{page}.source"] = source_filename
        names.loc[yr, f"{page}.page"] = excelf.sheet_names[pageid]
for src in sorted(pagehits.keys()):
    miss = [k for k in pagehits[src] if not pagehits[src][k]]
    miss = [m for m in miss if (m.find("Layout") < 0) and (m.find("Codes") < 0)]
    if miss:
        print("="*80)
        print(f"{src}\n{', '.join(miss)}")

& the results are all Puerto Rico (which is getting handled in #4312) and Boiler/Generator (which all have identical counterparts already mapped; apparently EIA includes those tables in two separate spreadsheets):

================================================================================
EIA923_Schedules_2_3_4_5_M_12_2017_Final_Revision.xlsx
Page 6 Plant Frame Puerto Rico
================================================================================
EIA923_Schedules_2_3_4_5_M_12_2018_Final_Revision.xlsx
Page 6 Plant Frame Puerto Rico
================================================================================
SCHEDULE 3A 5A 8A 8B 8C 8D 8E 8F 2008.xlsm
Boiler Fuel Data, Generator Data
================================================================================
SCHEDULE 3A 5A 8A 8B 8C 8D 8E 8F 2010 on NOV 30 2011.xls
Boiler, Generator
================================================================================
SCHEDULE 3A 5A 8A 8B 8C 8D 8E 8F REVISED 2009 04112011.xls
Boiler, Generator

To-do list

  • If updating analyses or data processing functions: make sure to update row count expectations in dbt tests.
  • Run make pytest-coverage locally to ensure that the merge queue will accept your PR.
  • Review the PR yourself and call out any questions or issues you have.
  • For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
  • For bigger ETL or data changes run the full ETL locally and then run the data validations using make pytest-validate.
  • Alternatively, run the build-deploy-pudl GitHub Action manually.

@krivard krivard added the eia923 Anything having to do with EIA Form 923 label Jun 10, 2025
@aesharpe aesharpe added bug Things that are just plain broken. missing-info Missing triage info: problem / impact statement labels Jul 25, 2025
@aesharpe aesharpe moved this from New to Icebox in Catalyst Megaproject Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Things that are just plain broken. eia923 Anything having to do with EIA Form 923 missing-info Missing triage info: problem / impact statement

Projects

Status: Icebox

Development

Successfully merging this pull request may close these issues.

2 participants