Skip to content

Conversation

@ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Apr 30, 2025

Description

These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 and ZSTD compression for Parquet writers in Iceberg and Hive

Motivation and Context

When users set the compression_codec session property or *.compression-codec connector property with LZ4 or ZSTD with parquet format as the default, tables would fail to be created due to the codec being null inside HiveCompressionCodec inside of Iceberg. I couldn't find a good reason for keeping these null, so I populated the correct enum variants and added tests to ensure they worked. Since this code is shared between Iceberg and Hive connectors, I added tests for different file type and compression codec variants to ensure we have compatibility across all of the potential configuration combinations.

Impact

  • Users can now set compression_codec to LZ4 and ZSTD when creating iceberg tables with parquet as the default file format
  • Pagefile formats now support LZ4 and ZSTD compression codecs

Test Plan

  • New test matrix for supported file formats and compression codecs in Hive and Iceberg connectors

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
* Add support for ZSTD and LZ4 compression codecs in Parquet format
* Add support for LZ4 format in ORC format

Hive Connector Changes
* Add support for ZSTD and LZ4 compression codecs in Parquet format
* Add support for LZ4 compression in ORC format

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Apr 30, 2025
@ZacBlanco ZacBlanco changed the title [Iceberg[ Add support for LZ4 and ZSTD compression codecs [Iceberg] Add support for LZ4 and ZSTD compression codecs Apr 30, 2025
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 4f83cd2 to 6c8b2a9 Compare April 30, 2025 22:04
@ZacBlanco ZacBlanco marked this pull request as ready for review May 1, 2025 23:42
@ZacBlanco ZacBlanco requested review from a team and hantangwangd as code owners May 1, 2025 23:42
@ZacBlanco ZacBlanco requested a review from jaystarshot May 1, 2025 23:42
@prestodb-ci prestodb-ci requested review from a team, infvg and pramodsatya and removed request for a team May 1, 2025 23:42
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me, just one nit.

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 6c8b2a9 to 97778fe Compare May 2, 2025 23:37
@ZacBlanco ZacBlanco changed the title [Iceberg] Add support for LZ4 and ZSTD compression codecs Add missing LZ4 and ZSTD compression codec classes May 2, 2025
agrawalreetika
agrawalreetika previously approved these changes May 5, 2025
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 97778fe to 74a3adf Compare May 9, 2025 21:28
Copy link
Member

@hantangwangd hantangwangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After checking the code in detail, I found that we haven't really support LZ4 for PARQUET for now. Referring to the code here. So do you think it makes sense to allow LZ4 configuration once it's really supported for PARQUET?

Comment on lines 2804 to 2807
assertQuerySucceeds(session, format("CREATE TABLE %s (i bigint) WITH (\"write.format.default\" = '%s')", tableName, format.name()));
assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName));
assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name()));
assertUpdate(format("INSERT INTO %s SELECT num FROM UNNEST(sequence(0, 1000)) as t(num)", tableName), "VALUES 1001");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assertQuerySucceeds(session, format("CREATE TABLE %s (i bigint) WITH (\"write.format.default\" = '%s')", tableName, format.name()));
assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName));
assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name()));
assertUpdate(format("INSERT INTO %s SELECT num FROM UNNEST(sequence(0, 1000)) as t(num)", tableName), "VALUES 1001");
assertQuerySucceeds(session, format("CREATE TABLE %s WITH (\"write.format.default\" = '%s') as select * from lineitem with no data", tableName, format.name()));
assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName));
assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name()));
assertUpdate(session, format("INSERT INTO %s SELECT * from lineitem", tableName), "select count(*) from lineitem");
assertQuery(session, format("SELECT * FROM %s", tableName), "select * from lineitem");

It seems that if we do the insertion using the session that including a compression codec, we will get test fails on PARQUET + LZ4/ZSTD.

For ZSTD, we might try to fetch its corresponding CompressionCodecName using an mismatched fully qualified name. Referring to here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is a good catch! I have fixed the writers by adding support for LZ4 and Zstd when set in the session

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the following statements still fail for LZ4:

        assertQuerySucceeds(session, format("CREATE TABLE %s WITH (\"write.format.default\" = '%s') as select * from lineitem with no data", tableName, format.name()));
        assertUpdate(session, format("INSERT INTO %s SELECT * from lineitem", tableName), "select count(*) from lineitem");
        assertQuery(format("SELECT * FROM %s", tableName), "select * from lineitem");

An error occurs when querying the actual parquet data compressed with LZ4. It seems there exists a problem with the airlift LZ4 compressor, as described in ParquetCompressor.getCompressor:

// When using airlift LZO or LZ4 compressor, decompressing page in reader throws exception.
......

The error information is as follows:

java.lang.IllegalArgumentException: Invalid offset or length (8, 16782243) in array of length 60232
	at io.airlift.compress.lz4.Lz4Decompressor.verifyRange(Lz4Decompressor.java:108)
	at io.airlift.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:34)
	at com.facebook.presto.parquet.ParquetCompressionUtils.decompress(ParquetCompressionUtils.java:151)
	......

@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 74a3adf to 7624603 Compare May 12, 2025 20:18
@ZacBlanco ZacBlanco requested a review from shangxinli as a code owner May 12, 2025 20:18
@steveburnett
Copy link
Contributor

I don't find the compression_codec session property in Presto Session Properties. Should it be added to the doc?

@hantangwangd
Copy link
Member

I don't find the compression_codec session property in Presto Session Properties. Should it be added to the doc?

It appears that hive.compression_codec and iceberg.compression_codec are connector specific session properties, would it be more appropriate to add them in the corresponding connector docs? Currently, Iceberg has a dedicated section for session properties, but Hive doesn't seem to have such a section yet. What do you think would be the best way here? @steveburnett

@steveburnett
Copy link
Contributor

I don't find the compression_codec session property in Presto Session Properties. Should it be added to the doc?

It appears that hive.compression_codec and iceberg.compression_codec are connector specific session properties, would it be more appropriate to add them in the corresponding connector docs? Currently, Iceberg has a dedicated section for session properties, but Hive doesn't seem to have such a section yet. What do you think would be the best way here? @steveburnett

Great question! You're correct, and both hive.compression-codec and iceberg.compression-codec are documented in Configuration Properties topics in the Hive Connector and Iceberg Connector pages.

I'm seeing three properties to discuss:

1 compression_codec
From the description of this PR "When users set the compression_codec session property" I assumed that there was a general compression_codec session property that is undocumented, and that is what I was asking if it should be doc'd in Presto Session Properties. Let me know if my assumption is wrong.

For the Hive and Iceberg connector-specific session properties, I agree that it would be appropriate to add them in the corresponding connector docs.

2 iceberg.compression_codec
As you mention, Iceberg Connector has a Session Properties topic so it would be appropriate to add iceberg.compression_codec.

3 hive.compression_codec
The Hive Connector doc does not have a separate Session Properties topic. Several session properties are mentioned throughout the page, either with their config properties or by themselves.

Revising the Hive Connector page to add a new Session Properties topic - which would imply gathering the various references to session properties to populate it - seems a large piece of work that isn't appropriate to add to this PR. I will open a doc issue about that reorganization of the Hive Connector page.

In the meantime, what do you think of adding a mention of hive.compression_codec to the entry for hive.compression-codec in the Hive Configuration Properties table? Something consistent with several other session property mentions in the Hive Connector page, like "The corresponding session property is hive.compression-codec. If you could, also consider adding the available options here as well, the way that iceberg.compression_codec is doc'd.

@steveburnett
Copy link
Contributor

I will open a doc issue about that reorganization of the Hive Connector page.

Doc issue created, see #25110.

@hantangwangd
Copy link
Member

I assumed that there was a general compression_codec session property that is undocumented, and that is what I was asking if it should be doc'd in Presto Session Properties. Let me know if my assumption is wrong.

I just confirmed that there is no compression_codec system session property.

In the meantime, what do you think of adding a mention of hive.compression_codec to the entry for hive.compression-codec in the Hive Configuration Properties table? Something consistent with several other session property mentions in the Hive Connector page, like "The corresponding session property is hive.compression-codec. If you could, also consider adding the available options here as well, the way that iceberg.compression_codec is doc'd.

Sounds great to me.

These codecs are available in the writers, but don't seem to have been
configured correctly. Trying to write tables with these formats
previously threw errors. This change enables LZ4 and ZSTD compression
for Parquet writers in Hive and Iceberg
@ZacBlanco ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 7624603 to c364bdc Compare October 15, 2025 21:47
@ZacBlanco ZacBlanco changed the title Add missing LZ4 and ZSTD compression codec classes feat: Add missing LZ4 and ZSTD compression codec classes Oct 15, 2025
@hantangwangd
Copy link
Member

Closing this in favor of PR #26346

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants