feat: Add missing LZ4 and ZSTD compression codec classes #25021

ZacBlanco · 2025-04-30T21:51:12Z

Description

These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 and ZSTD compression for Parquet writers in Iceberg and Hive

Motivation and Context

When users set the compression_codec session property or *.compression-codec connector property with LZ4 or ZSTD with parquet format as the default, tables would fail to be created due to the codec being null inside HiveCompressionCodec inside of Iceberg. I couldn't find a good reason for keeping these null, so I populated the correct enum variants and added tests to ensure they worked. Since this code is shared between Iceberg and Hive connectors, I added tests for different file type and compression codec variants to ensure we have compatibility across all of the potential configuration combinations.

Impact

Users can now set compression_codec to LZ4 and ZSTD when creating iceberg tables with parquet as the default file format
Pagefile formats now support LZ4 and ZSTD compression codecs

Test Plan

New test matrix for supported file formats and compression codecs in Hive and Iceberg connectors

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
* Add support for ZSTD and LZ4 compression codecs in Parquet format
* Add support for LZ4 format in ORC format

Hive Connector Changes
* Add support for ZSTD and LZ4 compression codecs in Parquet format
* Add support for LZ4 compression in ORC format

presto-hive/src/main/java/com/facebook/presto/hive/HiveCompressionCodec.java

hantangwangd

Change looks good to me, just one nit.

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java

hantangwangd

After checking the code in detail, I found that we haven't really support LZ4 for PARQUET for now. Referring to the code here. So do you think it makes sense to allow LZ4 configuration once it's really supported for PARQUET?

hantangwangd · 2025-05-12T03:57:50Z

presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java

+        assertQuerySucceeds(session, format("CREATE TABLE %s (i bigint) WITH (\"write.format.default\" = '%s')", tableName, format.name()));
+        assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName));
+        assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name()));
+        assertUpdate(format("INSERT INTO %s SELECT num FROM UNNEST(sequence(0, 1000)) as t(num)", tableName), "VALUES 1001");


Suggested change

assertQuerySucceeds(session, format("CREATE TABLE %s (i bigint) WITH (\"write.format.default\" = '%s')", tableName, format.name()));

assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName));

assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name()));

assertUpdate(format("INSERT INTO %s SELECT num FROM UNNEST(sequence(0, 1000)) as t(num)", tableName), "VALUES 1001");

assertQuerySucceeds(session, format("CREATE TABLE %s WITH (\"write.format.default\" = '%s') as select * from lineitem with no data", tableName, format.name()));

assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName));

assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name()));

assertUpdate(session, format("INSERT INTO %s SELECT * from lineitem", tableName), "select count(*) from lineitem");

assertQuery(session, format("SELECT * FROM %s", tableName), "select * from lineitem");

It seems that if we do the insertion using the session that including a compression codec, we will get test fails on PARQUET + LZ4/ZSTD.

For ZSTD, we might try to fetch its corresponding CompressionCodecName using an mismatched fully qualified name. Referring to here.

Thanks, this is a good catch! I have fixed the writers by adding support for LZ4 and Zstd when set in the session

Seems the following statements still fail for LZ4:

assertQuerySucceeds(session, format("CREATE TABLE %s WITH (\"write.format.default\" = '%s') as select * from lineitem with no data", tableName, format.name())); assertUpdate(session, format("INSERT INTO %s SELECT * from lineitem", tableName), "select count(*) from lineitem"); assertQuery(format("SELECT * FROM %s", tableName), "select * from lineitem");

An error occurs when querying the actual parquet data compressed with LZ4. It seems there exists a problem with the airlift LZ4 compressor, as described in ParquetCompressor.getCompressor:

// When using airlift LZO or LZ4 compressor, decompressing page in reader throws exception. ......

The error information is as follows:

java.lang.IllegalArgumentException: Invalid offset or length (8, 16782243) in array of length 60232 at io.airlift.compress.lz4.Lz4Decompressor.verifyRange(Lz4Decompressor.java:108) at io.airlift.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:34) at com.facebook.presto.parquet.ParquetCompressionUtils.decompress(ParquetCompressionUtils.java:151) ......

steveburnett · 2025-05-13T17:26:21Z

I don't find the compression_codec session property in Presto Session Properties. Should it be added to the doc?

hantangwangd · 2025-05-13T17:54:03Z

I don't find the compression_codec session property in Presto Session Properties. Should it be added to the doc?

It appears that hive.compression_codec and iceberg.compression_codec are connector specific session properties, would it be more appropriate to add them in the corresponding connector docs? Currently, Iceberg has a dedicated section for session properties, but Hive doesn't seem to have such a section yet. What do you think would be the best way here? @steveburnett

steveburnett · 2025-05-14T14:25:52Z

I don't find the compression_codec session property in Presto Session Properties. Should it be added to the doc?

It appears that hive.compression_codec and iceberg.compression_codec are connector specific session properties, would it be more appropriate to add them in the corresponding connector docs? Currently, Iceberg has a dedicated section for session properties, but Hive doesn't seem to have such a section yet. What do you think would be the best way here? @steveburnett

Great question! You're correct, and both hive.compression-codec and iceberg.compression-codec are documented in Configuration Properties topics in the Hive Connector and Iceberg Connector pages.

I'm seeing three properties to discuss:

1 compression_codec
From the description of this PR "When users set the compression_codec session property" I assumed that there was a general compression_codec session property that is undocumented, and that is what I was asking if it should be doc'd in Presto Session Properties. Let me know if my assumption is wrong.

For the Hive and Iceberg connector-specific session properties, I agree that it would be appropriate to add them in the corresponding connector docs.

2 iceberg.compression_codec
As you mention, Iceberg Connector has a Session Properties topic so it would be appropriate to add iceberg.compression_codec.

3 hive.compression_codec
The Hive Connector doc does not have a separate Session Properties topic. Several session properties are mentioned throughout the page, either with their config properties or by themselves.

Revising the Hive Connector page to add a new Session Properties topic - which would imply gathering the various references to session properties to populate it - seems a large piece of work that isn't appropriate to add to this PR. I will open a doc issue about that reorganization of the Hive Connector page.

In the meantime, what do you think of adding a mention of hive.compression_codec to the entry for hive.compression-codec in the Hive Configuration Properties table? Something consistent with several other session property mentions in the Hive Connector page, like "The corresponding session property is hive.compression-codec. If you could, also consider adding the available options here as well, the way that iceberg.compression_codec is doc'd.

steveburnett · 2025-05-14T15:19:59Z

I will open a doc issue about that reorganization of the Hive Connector page.

Doc issue created, see #25110.

hantangwangd · 2025-05-14T17:00:03Z

I assumed that there was a general compression_codec session property that is undocumented, and that is what I was asking if it should be doc'd in Presto Session Properties. Let me know if my assumption is wrong.

I just confirmed that there is no compression_codec system session property.

In the meantime, what do you think of adding a mention of hive.compression_codec to the entry for hive.compression-codec in the Hive Configuration Properties table? Something consistent with several other session property mentions in the Hive Connector page, like "The corresponding session property is hive.compression-codec. If you could, also consider adding the available options here as well, the way that iceberg.compression_codec is doc'd.

Sounds great to me.

These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 and ZSTD compression for Parquet writers in Hive and Iceberg

hantangwangd · 2025-10-22T07:23:20Z

Closing this in favor of PR #26346

prestodb-ci added the from:IBM PR from IBM label Apr 30, 2025

ZacBlanco changed the title ~~[Iceberg[ Add support for LZ4 and ZSTD compression codecs~~ [Iceberg] Add support for LZ4 and ZSTD compression codecs Apr 30, 2025

ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 4f83cd2 to 6c8b2a9 Compare April 30, 2025 22:04

ZacBlanco marked this pull request as ready for review May 1, 2025 23:42

ZacBlanco requested review from a team and hantangwangd as code owners May 1, 2025 23:42

ZacBlanco requested a review from jaystarshot May 1, 2025 23:42

prestodb-ci requested review from a team, infvg and pramodsatya and removed request for a team May 1, 2025 23:42

agrawalreetika reviewed May 2, 2025

View reviewed changes

presto-hive/src/main/java/com/facebook/presto/hive/HiveCompressionCodec.java Outdated Show resolved Hide resolved

hantangwangd reviewed May 2, 2025

View reviewed changes

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java Outdated Show resolved Hide resolved

ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 6c8b2a9 to 97778fe Compare May 2, 2025 23:37

ZacBlanco changed the title ~~[Iceberg] Add support for LZ4 and ZSTD compression codecs~~ Add missing LZ4 and ZSTD compression codec classes May 2, 2025

agrawalreetika previously approved these changes May 5, 2025

View reviewed changes

ZacBlanco dismissed agrawalreetika’s stale review via 74a3adf May 9, 2025 21:28

ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 97778fe to 74a3adf Compare May 9, 2025 21:28

hantangwangd reviewed May 12, 2025

View reviewed changes

ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 74a3adf to 7624603 Compare May 12, 2025 20:18

ZacBlanco requested a review from shangxinli as a code owner May 12, 2025 20:18

steveburnett mentioned this pull request May 14, 2025

[docs] Add Session Properties to Hive Connector #25110

Open

This was referenced Oct 9, 2025

Iceberg configuration properties 'iceberg.compression-codec description is inaccurate #26261

Closed

docs: Update description of iceberg.compression-codec #26266

Merged

ZacBlanco force-pushed the upstream-iceberg-compression-codecs branch from 7624603 to c364bdc Compare October 15, 2025 21:47

ZacBlanco changed the title ~~Add missing LZ4 and ZSTD compression codec classes~~ feat: Add missing LZ4 and ZSTD compression codec classes Oct 15, 2025

hantangwangd mentioned this pull request Oct 16, 2025

Add support for missing compression codecs #26334

Closed

hantangwangd closed this Oct 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add missing LZ4 and ZSTD compression codec classes #25021

feat: Add missing LZ4 and ZSTD compression codec classes #25021

ZacBlanco commented Apr 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

hantangwangd left a comment

Uh oh!

Uh oh!

hantangwangd left a comment

Uh oh!

hantangwangd May 12, 2025

Uh oh!

ZacBlanco May 12, 2025

Uh oh!

hantangwangd May 13, 2025

Uh oh!

steveburnett commented May 13, 2025

Uh oh!

hantangwangd commented May 13, 2025

Uh oh!

steveburnett commented May 14, 2025

Uh oh!

steveburnett commented May 14, 2025

Uh oh!

hantangwangd commented May 14, 2025

Uh oh!

hantangwangd commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

-        assertQuerySucceeds(session, format("CREATE TABLE %s (i bigint) WITH (\"write.format.default\" = '%s')", tableName, format.name()));
-        assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName));
-        assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name()));
-        assertUpdate(format("INSERT INTO %s SELECT num FROM UNNEST(sequence(0, 1000)) as t(num)", tableName), "VALUES 1001");
+        assertQuerySucceeds(session, format("CREATE TABLE %s WITH (\"write.format.default\" = '%s') as select * from lineitem with no data", tableName, format.name()));
+        assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName));
+        assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name()));
+        assertUpdate(session, format("INSERT INTO %s SELECT * from lineitem", tableName), "select count(*) from lineitem");
+        assertQuery(session, format("SELECT * FROM %s", tableName), "select * from lineitem");

feat: Add missing LZ4 and ZSTD compression codec classes #25021

feat: Add missing LZ4 and ZSTD compression codec classes #25021

Conversation

ZacBlanco commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

Uh oh!

hantangwangd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hantangwangd left a comment

Choose a reason for hiding this comment

Uh oh!

hantangwangd May 12, 2025

Choose a reason for hiding this comment

Uh oh!

ZacBlanco May 12, 2025

Choose a reason for hiding this comment

Uh oh!

hantangwangd May 13, 2025

Choose a reason for hiding this comment

Uh oh!

steveburnett commented May 13, 2025

Uh oh!

hantangwangd commented May 13, 2025

Uh oh!

steveburnett commented May 14, 2025

Uh oh!

steveburnett commented May 14, 2025

Uh oh!

hantangwangd commented May 14, 2025

Uh oh!

hantangwangd commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ZacBlanco commented Apr 30, 2025 •

edited

Loading