feat: Add support for LZ4 and ZSTD compression codecs #26346

hantangwangd · 2025-10-17T05:08:57Z

Co-authored-by: Zac Blanco [email protected]

Description

These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 on ORC and ZSTD on Parquet for data writers in Hive and Iceberg

Address issue #26334

Motivation and Context

Impact

Test Plan

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.
If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
 * Add support for ``ZSTD`` compression codec in Parquet format
 * Add support for ``LZ4`` compression codec in ORC format

 Hive Connector Changes
 * Add support for`` ZSTD`` compression codec in Parquet format
 * Add support for ``LZ4`` compression codec in ORC format

sourcery-ai · 2025-10-17T05:09:04Z

Reviewer's Guide

This PR implements LZ4 support in ORC/PAGEFILE and ZSTD support in Parquet across the Hive and Iceberg connectors by extending the codec enum, updating writer and configuration layers, exposing the compression_codec session property, and parameterizing tests to verify all format/codec combinations.

Sequence diagram for data write with new compression codecs

sequenceDiagram
    actor User
    participant HiveConnector
    participant IcebergConnector
    participant Writer
    User->>HiveConnector: Create table with ORC/LZ4 or PARQUET/ZSTD
    HiveConnector->>Writer: Configure writer with selected codec
    Writer->>HiveConnector: Write data using codec
    User->>IcebergConnector: Create table with ORC/LZ4 or PARQUET/ZSTD
    IcebergConnector->>Writer: Configure writer with selected codec
    Writer->>IcebergConnector: Write data using codec

Class diagram for updated HiveCompressionCodec enum

classDiagram
    class HiveCompressionCodec {
        +NONE
        +SNAPPY
        +GZIP
        +LZ4
        +ZSTD
        -Optional<Class<? extends CompressionCodec>> codec
        -CompressionKind orcCompressionKind
        -CompressionCodecName parquetCompressionCodec
        -Predicate<HiveStorageFormat> supportedStorageFormats
        +getOrcCompressionKind()
        +getParquetCompressionCodec()
        +isSupportedStorageFormat(HiveStorageFormat)
    }
    HiveCompressionCodec --> HiveStorageFormat
    HiveCompressionCodec --> CompressionKind
    HiveCompressionCodec --> CompressionCodecName
    HiveCompressionCodec --> CompressionCodec

Class diagram for ParquetWriter compression codec handling

classDiagram
    class ParquetWriter {
        +ParquetWriter(..., String compressionCodecClass)
        -OutputStreamSliceOutput outputStream
        -List<String> names
        -CompressionCodec getCompressionCodec(String compressionCodecClass)
    }
    class OutputStreamSliceOutput
    class CompressionCodecClass
    ParquetWriter --> OutputStreamSliceOutput
    ParquetWriter --> CompressionCodecClass

Class diagram for IcebergUtil.populateTableProperties changes

classDiagram
    class IcebergUtil {
        +populateTableProperties(IcebergAbstractMetadata, Table, HiveCompressionCodec, ...)
    }
    IcebergUtil --> HiveCompressionCodec

Class diagram for ConfigurationUtils.setCompressionProperties changes

classDiagram
    class ConfigurationUtils {
        -setCompressionProperties(Configuration config, HiveCompressionCodec compression)
    }
    ConfigurationUtils --> HiveCompressionCodec

Class diagram for session property exposure in HiveSessionProperties and IcebergSessionProperties

classDiagram
    class HiveSessionProperties {
        +COMPRESSION_CODEC
    }
    class IcebergSessionProperties {
        +COMPRESSION_CODEC
    }

File-Level Changes

Change	Details	Files
Enable LZ4 and ZSTD as supported compression codecs in HiveCompressionCodec	Extend enum entries for LZ4 and ZSTD with the correct codec classes and supported formats Switch parquetCompressionCodec field and accessor from Optional to a direct CompressionCodecName	`presto-hive/src/main/java/com/facebook/presto/hive/HiveCompressionCodec.java`
Propagate new codec mappings in ParquetWriter, ConfigurationUtils, and IcebergFileWriterFactory	Recognize both Hadoop and Parquet ZStandard codec class names in ParquetWriter Remove Optional wrappers and set ParquetOutputFormat.COMPRESSION directly in ConfigurationUtils Pass CompressionCodecName directly (not Optional) to IcebergFileWriterFactory	`presto-parquet/src/main/java/com/facebook/presto/parquet/writer/ParquetWriter.java` `presto-hive/src/main/java/com/facebook/presto/hive/util/ConfigurationUtils.java` `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergFileWriterFactory.java`
Write correct parquet compression property values in IcebergUtil.populateTableProperties	Use name() on CompressionCodecName instead of Optional.get().toString() when populating PARQUET_COMPRESSION	`presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java`
Make compression_codec session property public for Hive and Iceberg connectors	Expose COMPRESSION_CODEC as public static in HiveSessionProperties Introduce COMPRESSION_CODEC constant in IcebergSessionProperties	`presto-hive/src/main/java/com/facebook/presto/hive/HiveSessionProperties.java` `presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java`
Introduce DataProviders and parameterized tests for format/codec combinations	Replace standalone pagefile compression test with DataProvider-driven testFormatAndCompressionCodecs in Hive smoke tests Add similar data-driven tests in Iceberg distributed test base and smoke test Update TestIcebergUtil to mark LZ4 and ZSTD as supported in matrix Simplify FileFormat benchmark writer to drop unsupported fallback and use direct codec class names	`presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java` `presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java` `presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java` `presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergUtil.java` `presto-hive/src/test/java/com/facebook/presto/hive/benchmark/FileFormat.java`

Possibly linked issues

Add support for missing compression codecs #26334: PR adds LZ4 compression to ORC and ZSTD to Parquet, resolving the issue of missing codec support.
Add support for missing compression codecs #26334: The PR enables ZSTD compression for Parquet and LZ4 for ORC, directly addressing the inaccurate documentation and lack of support for these codecs reported in the issue.

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

PingLiuPing · 2025-10-17T08:48:14Z

This change enables LZ4 on Parquet and ZSTD on ORC for data writers in Hive and Iceberg

How about ZSTD on parquet?

hantangwangd · 2025-10-17T09:10:32Z

How about ZSTD on parquet?

Oh, I made a mistake. It should be adding support for ZSTD on Parquet and LZ4 on ORC. I've updated the PR description. Thanks for pointing out this.

sourcery-ai

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:5722-5726` </location>
<code_context>
+            assertQuery(format("SELECT sum(custkey) FROM %s", tableName), "SELECT sum(custkey) FROM orders");
+            assertQuerySucceeds(format("DROP TABLE %s", tableName));
+        }
+        else {
+            assertQueryFails(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
+                            tableName, format.name()),
+                    format("%s compression is not supported with %s", codec, format));
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding assertions for error messages when unsupported codecs are used.

Adding assertions for specific error messages will ensure that error handling remains consistent and prevent regressions in messaging.

```suggestion
        else {
            String expectedErrorMessage = format("%s compression is not supported with %s", codec, format);
            assertQueryFails(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
                            tableName, format.name()),
                    expectedErrorMessage);
            // Additional assertion to verify the error message is present in the thrown exception
            try {
                getQueryRunner().execute(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
                        tableName, format.name()));
                fail("Expected query to fail due to unsupported compression codec");
            }
            catch (RuntimeException e) {
                assertTrue(e.getMessage().contains(expectedErrorMessage),
                        format("Error message should contain: '%s', but was: '%s'", expectedErrorMessage, e.getMessage()));
            }
        }
```
</issue_to_address>

### Comment 2
<location> `presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java:3109-3110` </location>
<code_context>
+            assertQuery(format("SELECT sum(custkey) FROM %s", tableName), "SELECT sum(custkey) FROM orders");
+            assertQuerySucceeds(format("DROP TABLE %s", tableName));
+        }
+        else {
+            assertQueryFails(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
+                            tableName, format.name()),
</code_context>

<issue_to_address>
**suggestion (testing):** Consider asserting the error message for unsupported codec/format combinations.

Please add assertions to verify that the error message for unsupported codec/format combinations matches the expected output.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

steveburnett

LGTM! (docs)

Pull branch, local doc build. Thanks!

steveburnett · 2025-10-17T13:52:39Z

I approved the doc, but I have a question for you to consider. Your release note mentions Hive Connector Changes as well as the Iceberg Connector Changes.

Should the Hive Configuration Properties descriptions for hive.storage-format and hive.compression-codec be updated with these recent updates in the Iceberg connector?

hantangwangd · 2025-10-17T14:16:46Z

@steveburnett Thanks for your comment. After double-checking, I found that Hive's documentation for the hive.compression-codec property is quite general: The compression codec to use when writing files. So it seems that this modification doesn't actually require any changes to Hive's documentation to maintain accuracy.

Besides, given that Hive supports many other file formats in addition to Parquet and ORC, creating a detailed guide on which compression codecs work with each format—like Iceberg has—might be better suited for a dedicated PR. What's your opinion?

steveburnett · 2025-10-17T16:46:07Z

Besides, given that Hive supports many other file formats in addition to Parquet and ORC, creating a detailed guide on which compression codecs work with each format—like Iceberg has—might be better suited for a dedicated PR. What's your opinion?

You make a great suggestion!

Given the added work, I agree that improving the Hive documentation for these properties to the same level of detail as we're achieving in the Iceberg doc definitely deserves a separate PR. Including that new research and doc work in this PR would significantly expand the scope of work for this PR much more than I consider reasonable.

PingLiuPing

Thanks for the PR.
Just few nits.

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java

presto-parquet/src/main/java/com/facebook/presto/parquet/writer/ParquetWriter.java

PingLiuPing

Thanks.

I will pick this commit to build a image and run some perf test.
Will let you know the result.

PingLiuPing

Thanks.

steveburnett · 2025-10-21T18:16:07Z

Besides, given that Hive supports many other file formats in addition to Parquet and ORC, creating a detailed guide on which compression codecs work with each format—like Iceberg has—might be better suited for a dedicated PR. What's your opinion?

You make a great suggestion!

Given the added work, I agree that improving the Hive documentation for these properties to the same level of detail as we're achieving in the Iceberg doc definitely deserves a separate PR. Including that new research and doc work in this PR would significantly expand the scope of work for this PR much more than I consider reasonable.

I opened #26384 for the Hive documentation improvement.

hantangwangd · 2025-10-21T23:52:09Z

I opened #26384 for the Hive documentation improvement.

Thanks @steveburnett

These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables ZSTD on Parquet and LZ4 on ORC for data writers in Hive and Iceberg

PingLiuPing · 2025-10-22T08:40:46Z

@hantangwangd

I run tpch load on a 8 workers cluster, each with 16 vCPU and 128 GB memory. And backend storage is S3.
With SF100:
SNAPPY -> 780s
GZIP -> 780s
ZSTD -> 407s

With SF1000:
SNAPPY -> 6780s
GZIP -> 6720s
ZSTD -> 3450s

hantangwangd · 2025-10-22T08:57:11Z

I run tpch load on a 8 workers cluster, each with 16 vCPU and 128 GB memory. And backend storage is S3. With SF100: SNAPPY -> 780s GZIP -> 780s ZSTD -> 407s

With SF1000: SNAPPY -> 6780s GZIP -> 6720s ZSTD -> 3450s

@PingLiuPing Great, this makes setting ZSTD as Iceberg's default compression codec make more sense.

PingLiuPing · 2025-10-22T09:09:26Z

@PingLiuPing Great, this makes setting ZSTD as Iceberg's default compression codec make more sense.

Yes, are you working on this? If not I can submit a PR for this.

hantangwangd · 2025-10-22T09:14:48Z

@PingLiuPing I'm not working on this, so please feel free to submit a PR if you're interested.

#26399) ## Description  Iceberg use GZIP as the default compression for parquet before version 1.4. See info [here](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableProperties.java#L144-L147) And when iceberg connector was introduced to Presto, the version of Iceberg is 0.9.0. And hence it uses GZIP as the default compression codec at that time. Now that Iceberg has changed the default compression codec to ZSTD. And the iceberg version in Presto has upgraded to 1.8.1. We should change the default compression codec to ZSTD to align with iceberg. Moreover, from the performance test result I found that ZSTD has much better performance over GZIP. See [results](#26346 (comment)). ## Motivation and Context   ## Impact  ## Test Plan  show session output: <img width="1477" height="31" alt="Screenshot 2025-10-22 at 11 09 52" src="https://github.com/user-attachments/assets/f98d0d3b-6a08-4952-b29d-39fdb234f4d8" /> The actual data file metadata: $ parquet-tools inspect 127dbe88-372d-4872-a00f-02669277732e.parquet ############ file meta data ############ created_by: num_columns: 2 num_rows: 2 num_row_groups: 1 format_version: 1.0 serialized_size: 175 ############ Columns ############ c1 c2 ############ Column(c1) ############ name: c1 path: c1 max_definition_level: 1 max_repetition_level: 0 physical_type: INT32 logical_type: None converted_type (legacy): NONE compression: ZSTD (space_saved: -42%) ############ Column(c2) ############ name: c2 path: c2 max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: None converted_type (legacy): NONE compression: ZSTD (space_saved: -33%) ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == RELEASE NOTES == Iceberg Connector Changes * Replace default iceberg compression codec from GZIP to ZSTD. ```

hantangwangd changed the title ~~Feat: Add support for LZ4 and ZSTD compression codecs~~ feat: Add support for LZ4 and ZSTD compression codecs Oct 17, 2025

hantangwangd force-pushed the add_compression_codecs branch from 39d97ab to 589ee3d Compare October 17, 2025 07:08

hantangwangd changed the title ~~feat: Add support for LZ4 and ZSTD compression codecs~~ feat: Add support for LZ4 and ZSTD compression codecs Oct 17, 2025

hantangwangd force-pushed the add_compression_codecs branch from 589ee3d to cf20c03 Compare October 17, 2025 10:02

hantangwangd marked this pull request as ready for review October 17, 2025 13:24

hantangwangd requested review from a team, ZacBlanco, elharo, shangxinli and steveburnett as code owners October 17, 2025 13:24

hantangwangd requested review from PingLiuPing, agrawalreetika and tdcmeehan October 17, 2025 13:25

sourcery-ai bot reviewed Oct 17, 2025

View reviewed changes

tdcmeehan self-assigned this Oct 17, 2025

steveburnett previously approved these changes Oct 17, 2025

View reviewed changes

PingLiuPing reviewed Oct 17, 2025

View reviewed changes

hantangwangd dismissed steveburnett’s stale review via 2228d73 October 18, 2025 03:54

hantangwangd force-pushed the add_compression_codecs branch from cf20c03 to 2228d73 Compare October 18, 2025 03:54

PingLiuPing previously approved these changes Oct 20, 2025

View reviewed changes

hantangwangd dismissed PingLiuPing’s stale review via d234e8e October 20, 2025 10:01

hantangwangd force-pushed the add_compression_codecs branch 2 times, most recently from d234e8e to 572b005 Compare October 20, 2025 13:14

PingLiuPing approved these changes Oct 20, 2025

View reviewed changes

hantangwangd linked an issue Oct 21, 2025 that may be closed by this pull request

Add support for missing compression codecs #26334

Closed

hantangwangd force-pushed the add_compression_codecs branch from 572b005 to 8295fc5 Compare October 21, 2025 13:30

ZacBlanco approved these changes Oct 21, 2025

View reviewed changes

steveburnett mentioned this pull request Oct 21, 2025

doc: improve doc of hive.storage-format and hive.compression-codec #26384

Open

tdcmeehan approved these changes Oct 22, 2025

View reviewed changes

hantangwangd force-pushed the add_compression_codecs branch from 8295fc5 to 39fedf0 Compare October 22, 2025 04:21

hantangwangd merged commit b92b8f2 into prestodb:master Oct 22, 2025
110 of 111 checks passed

hantangwangd deleted the add_compression_codecs branch October 22, 2025 07:21

hantangwangd mentioned this pull request Oct 22, 2025

feat: Add missing LZ4 and ZSTD compression codec classes #25021

Closed

6 tasks

PingLiuPing mentioned this pull request Oct 22, 2025

feat(plugin-iceberg): Change iceberg default compression codec to ZSTD #26399

Merged

7 tasks

feat: Add support for LZ4 and ZSTD compression codecs #26346

feat: Add support for LZ4 and ZSTD compression codecs #26346

Uh oh!

Conversation

hantangwangd commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

sourcery-ai bot commented Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

Sequence diagram for data write with new compression codecs

Class diagram for updated HiveCompressionCodec enum

Class diagram for ParquetWriter compression codec handling

Class diagram for IcebergUtil.populateTableProperties changes

Class diagram for ConfigurationUtils.setCompressionProperties changes

Class diagram for session property exposure in HiveSessionProperties and IcebergSessionProperties

File-Level Changes

Possibly linked issues

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

PingLiuPing commented Oct 17, 2025

Uh oh!

hantangwangd commented Oct 17, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

steveburnett commented Oct 17, 2025

Uh oh!

hantangwangd commented Oct 17, 2025

Uh oh!

steveburnett commented Oct 17, 2025

Uh oh!

PingLiuPing left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

PingLiuPing left a comment

Choose a reason for hiding this comment

Uh oh!

PingLiuPing left a comment

Choose a reason for hiding this comment

Uh oh!

steveburnett commented Oct 21, 2025

Uh oh!

hantangwangd commented Oct 21, 2025

Uh oh!

Uh oh!

PingLiuPing commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hantangwangd commented Oct 22, 2025

Uh oh!

PingLiuPing commented Oct 22, 2025

Uh oh!

hantangwangd commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hantangwangd commented Oct 17, 2025 •

edited

Loading

sourcery-ai bot commented Oct 17, 2025 •

edited

Loading

PingLiuPing commented Oct 22, 2025 •

edited

Loading