Skip to content

feat: update pyiceberg/catalog/hive.py to support hive 4.x.x #2206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

igorvoltaic
Copy link

resolves #1222

Rationale for this change

Starting at version 4.0.1, Hive metastore removed deprecated thrift APIs that py-iceberg is currently using. When trying to create a table with catalog.create_table_transaction using Hive metastore 4.0.1, py-iceberg raise an unexpected thrift.Thrift.TApplicationException: Invalid method name: 'get_table' error

Are there any user-facing changes?

not expected

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM this change seems to be backwards compatible. We dont currently have a way to test out hive 4.x.x though, but we can address that separately

@@ -389,8 +389,8 @@ def _create_hive_table(self, open_client: Client, hive_table: HiveTable) -> None

def _get_hive_table(self, open_client: Client, database_name: str, table_name: str) -> HiveTable:
try:
return open_client.get_table(dbname=database_name, tbl_name=table_name)
except NoSuchObjectException as e:
return open_client.get_tables(db_name=database_name, pattern=table_name).pop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like get_tables is available in previous versions of thrift clients

i see

    def get_tables(self, db_name, pattern):
        """
        Parameters:
         - db_name
         - pattern

        """
        self.send_get_tables(db_name, pattern)
        return self.recv_get_tables()

in vendor/hive_metastore/ThriftHiveMetastore.py

@kevinjqliu
Copy link
Contributor

ah looks like CI failed because we mock .get_table in tests
https://grep.app/search?f.path=tests%2F&f.path.pattern=tests&f.repo.pattern=iceberg-python&q=.get_table

need to switch those to get_tables

@igorvoltaic
Copy link
Author

ah looks like CI failed because we mock .get_table in tests https://grep.app/search?f.path=tests%2F&f.path.pattern=tests&f.repo.pattern=iceberg-python&q=.get_table

need to switch those to get_tables

unfortunately get_tables returns str objects instead of expected HiveTable object, tests saved me thanks

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, looks like get_table_objects_by_name is backwards compatible too.

Could you try this PR out against a hive 4.x.x deployment?

It seems like we are using hive 4.0.0 in the integration tests. We are using the hive client session_catalog_hive in tests
https://grep.app/search?f.repo=apache%2Ficeberg-python&q=session_catalog_hive
So im confused why the tests were not failing before

@kevinjqliu
Copy link
Contributor

oh interesting. so get_table was not deprecated in 4.0.0, but rather in 4.0.1 🤔

see the 4.0.1 changelog
[HIVE-26537] - Deprecate older APIs in the HMS
https://github.com/apache/hive/pull/3599/files#diff-47ffee8549a256db9156ce4287f750674b1689362362db066010ff60319dcca0L288

@kevinjqliu
Copy link
Contributor

@igorvoltaic lets run the integration tests against 4.0.1!

FROM apache/hive:4.0.0

@kevinjqliu
Copy link
Contributor

ah we cannot do that yet. kind of a chicken and egg problem

we use pyiceberg 0.9.1 to provision the hive catalog

RUN pip3 install "pyiceberg[s3fs,hive,pyarrow]==${PYICEBERG_VERSION}"

thats ok, i'll verify locally

@kevinjqliu
Copy link
Contributor

Looks like get_table_objects_by_name was removed in 4.0.1 too
https://github.com/apache/hive/pull/3599/files#diff-47ffee8549a256db9156ce4287f750674b1689362362db066010ff60319dcca0L293
so we'll need to find a different solution.

I found this out after getting the hive 4.0.1 run with local pyiceberg, see #2217

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cant use get_table_objects_by_name its also removed in hive 4.0.1

@kevinjqliu
Copy link
Contributor

@Fokko
Copy link
Contributor

Fokko commented Jul 16, 2025

Great catch @kevinjqliu:

cant use get_table_objects_by_name its also removed in hive 4.0.1

To catch this, we probably want to regenerate the vendor package against 4.0.1: https://github.com/apache/iceberg-python/tree/main/vendor#vendor-packages

@igorvoltaic
Copy link
Author

Great catch @kevinjqliu:

cant use get_table_objects_by_name its also removed in hive 4.0.1

To catch this, we probably want to regenerate the vendor package against 4.0.1: https://github.com/apache/iceberg-python/tree/main/vendor#vendor-packages

Hey @Fokko, there is one already https://raw.githubusercontent.com/apache/hive/refs/heads/master/standalone-metastore/metastore-common/src/gen/thrift/gen-py/hive_metastore/ThriftHiveMetastore.py

cant use get_table_objects_by_name its also removed in hive 4.0.1

@kevinjqliu thanks for your help with testing. will try to find a way to keep it backwards compatible, but not so sure yet

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Jul 16, 2025

ah this is the gift that keeps on giving...

I've made a couple changes in #2217 to

  • use hive 4.0.1 in integration tests
  • use current pyiceberg instead of previous versions in integration tests
  • regenerate vendor/
  • update pyiceberg hive client to use the non-deprecated get_table_req

Using hive 4.0.1 is blocked now due to apache/iceberg#12878 because spark hms connector is not compatible

@igorvoltaic
Copy link
Author

I think I got this. Made some more changes and we've tested those in our stage v4 server, moving on to production. Do we want to keep backwards compatibility with v3?
If so I'll push new updates in a couple of days

@kevinjqliu
Copy link
Contributor

I was not able to get our integration tests working with hive 4.0.1, see #2217. The main issue is with spark's hive catalog using deprecate APIs, apache/iceberg#12878

Made some more changes and we've tested those in our stage v4 server, moving on to production. Do we want to keep backwards compatibility with v3?

As long as these changes are backwards compatible with older hive clients/servers, we can make this change. But I'd need your help to ensure these changes work against hive 4.0.1

@igorvoltaic
Copy link
Author

igorvoltaic commented Jul 24, 2025

So the idea is to keep both vendor module versions for hive_metastore, since we'd like to keep the lib backwards compatible. Still waiting for some tests to be run thou

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hive metastore 4.0.1 remove deprecated thrift APIs
3 participants