Skip to content

Excel file format cannot be determined #60

@mdavis-xyz

Description

@mdavis-xyz

I'm running some code which used to work. Now I get an error. I think the URL for the participant registration list is wrong.

Steps to reproduce

  • create a new Python virtual environment (3.12.3)
  • pip install nemosis
  • create a new empty folder for the data cache
  • Run this script:
from nemosis import static_table

nemosis_data_cache = "/home/matthew/Data/nemosis/"

static_pd: "pd.DataFrame" = static_table(
    "Generators and Scheduled Loads", nemosis_data_cache
)

print(static_pd)

Expected behavior: Script runs without error, and prints the static data.

Actual behavior:

INFO: Retrieving static table Generators and Scheduled Loads
Traceback (most recent call last):
  File "/home/matthew/Downloads/nemosis-dep/main.py", line 5, in <module>
    static_pd: "pd.DataFrame" = static_table(
                                ^^^^^^^^^^^^^
  File "/home/matthew/Downloads/nemosis-dep/venv/lib/python3.12/site-packages/nemosis/data_fetch_methods.py", line 297, in static_table
    table = static_file_reader_map[table_name](path_and_name, table_name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matthew/Downloads/nemosis-dep/venv/lib/python3.12/site-packages/nemosis/data_fetch_methods.py", line 404, in _read_excel
    xls = _pd.ExcelFile(path_and_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matthew/Downloads/nemosis-dep/venv/lib/python3.12/site-packages/pandas/io/excel/_base.py", line 1554, in __init__
    raise ValueError(
ValueError: Excel file format cannot be determined, you must specify an engine manually.

Debugging Info

The installed libraries are:

  • pandas: 2.3.3
  • nemosis: 3.8.1
  • openpyxl: 3.1.5
  • pyxlsb: 1.0.10

The docs for pd.ExcelFile say that something changed in version 1.2.0, but that was superseded years ago.

The Cause

The AEMO participant registration file is a .xlsx. It has been that way for over a year

However the code here has a url ending in .xls.

https://www.aemo.com.au/-/media/Files/Electricity/NEM/Participant_Information/NEM-Registration-and-Exemption-List.xls

I do not know where this came from. When I open that URL with firefox, I see the file prompt suggests a .xlsx ending.

Image

So I think we were passing a filename to Pandas which ended with .xls, but the content was actually .xlsx. Maybe Pandas was previously smart enough to figure it out, but now expects us to be more explicit. Although when I try to use older versions of Pandas and xlrd, I get the same error.

Proposed Solution

I do not think we need to actually specify an engine. Rather we should just fix the URL.

Either:

  • append an x to the URL; or
  • replace the whole URL with https://www.aemo.com.au/-/media/files/electricity/nem/participant_information/nem-registration-and-exemption-list.xlsx

And change the filenames we save this to (e.g. here)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions