Skip to content

Conversation

@hantangwangd
Copy link
Member

@hantangwangd hantangwangd commented Oct 17, 2025

Co-authored-by: Zac Blanco [email protected]

Description

These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 on ORC and ZSTD on Parquet for data writers in Hive and Iceberg

Address issue #26334

Motivation and Context

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.
  • If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== RELEASE NOTES ==

Iceberg Connector Changes
 * Add support for ``ZSTD`` compression codec in Parquet format
 * Add support for ``LZ4`` compression codec in ORC format

 Hive Connector Changes
 * Add support for`` ZSTD`` compression codec in Parquet format
 * Add support for ``LZ4`` compression codec in ORC format

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Oct 17, 2025

Reviewer's Guide

This PR implements LZ4 support in ORC/PAGEFILE and ZSTD support in Parquet across the Hive and Iceberg connectors by extending the codec enum, updating writer and configuration layers, exposing the compression_codec session property, and parameterizing tests to verify all format/codec combinations.

Sequence diagram for data write with new compression codecs

sequenceDiagram
    actor User
    participant HiveConnector
    participant IcebergConnector
    participant Writer
    User->>HiveConnector: Create table with ORC/LZ4 or PARQUET/ZSTD
    HiveConnector->>Writer: Configure writer with selected codec
    Writer->>HiveConnector: Write data using codec
    User->>IcebergConnector: Create table with ORC/LZ4 or PARQUET/ZSTD
    IcebergConnector->>Writer: Configure writer with selected codec
    Writer->>IcebergConnector: Write data using codec
Loading

Class diagram for updated HiveCompressionCodec enum

classDiagram
    class HiveCompressionCodec {
        +NONE
        +SNAPPY
        +GZIP
        +LZ4
        +ZSTD
        -Optional<Class<? extends CompressionCodec>> codec
        -CompressionKind orcCompressionKind
        -CompressionCodecName parquetCompressionCodec
        -Predicate<HiveStorageFormat> supportedStorageFormats
        +getOrcCompressionKind()
        +getParquetCompressionCodec()
        +isSupportedStorageFormat(HiveStorageFormat)
    }
    HiveCompressionCodec --> HiveStorageFormat
    HiveCompressionCodec --> CompressionKind
    HiveCompressionCodec --> CompressionCodecName
    HiveCompressionCodec --> CompressionCodec
Loading

Class diagram for ParquetWriter compression codec handling

classDiagram
    class ParquetWriter {
        +ParquetWriter(..., String compressionCodecClass)
        -OutputStreamSliceOutput outputStream
        -List<String> names
        -CompressionCodec getCompressionCodec(String compressionCodecClass)
    }
    class OutputStreamSliceOutput
    class CompressionCodecClass
    ParquetWriter --> OutputStreamSliceOutput
    ParquetWriter --> CompressionCodecClass
Loading

Class diagram for IcebergUtil.populateTableProperties changes

classDiagram
    class IcebergUtil {
        +populateTableProperties(IcebergAbstractMetadata, Table, HiveCompressionCodec, ...)
    }
    IcebergUtil --> HiveCompressionCodec
Loading

Class diagram for ConfigurationUtils.setCompressionProperties changes

classDiagram
    class ConfigurationUtils {
        -setCompressionProperties(Configuration config, HiveCompressionCodec compression)
    }
    ConfigurationUtils --> HiveCompressionCodec
Loading

Class diagram for session property exposure in HiveSessionProperties and IcebergSessionProperties

classDiagram
    class HiveSessionProperties {
        +COMPRESSION_CODEC
    }
    class IcebergSessionProperties {
        +COMPRESSION_CODEC
    }
Loading

File-Level Changes

Change Details Files
Enable LZ4 and ZSTD as supported compression codecs in HiveCompressionCodec
  • Extend enum entries for LZ4 and ZSTD with the correct codec classes and supported formats
  • Switch parquetCompressionCodec field and accessor from Optional to a direct CompressionCodecName
presto-hive/src/main/java/com/facebook/presto/hive/HiveCompressionCodec.java
Propagate new codec mappings in ParquetWriter, ConfigurationUtils, and IcebergFileWriterFactory
  • Recognize both Hadoop and Parquet ZStandard codec class names in ParquetWriter
  • Remove Optional wrappers and set ParquetOutputFormat.COMPRESSION directly in ConfigurationUtils
  • Pass CompressionCodecName directly (not Optional) to IcebergFileWriterFactory
presto-parquet/src/main/java/com/facebook/presto/parquet/writer/ParquetWriter.java
presto-hive/src/main/java/com/facebook/presto/hive/util/ConfigurationUtils.java
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergFileWriterFactory.java
Write correct parquet compression property values in IcebergUtil.populateTableProperties
  • Use name() on CompressionCodecName instead of Optional.get().toString() when populating PARQUET_COMPRESSION
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java
Make compression_codec session property public for Hive and Iceberg connectors
  • Expose COMPRESSION_CODEC as public static in HiveSessionProperties
  • Introduce COMPRESSION_CODEC constant in IcebergSessionProperties
presto-hive/src/main/java/com/facebook/presto/hive/HiveSessionProperties.java
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java
Introduce DataProviders and parameterized tests for format/codec combinations
  • Replace standalone pagefile compression test with DataProvider-driven testFormatAndCompressionCodecs in Hive smoke tests
  • Add similar data-driven tests in Iceberg distributed test base and smoke test
  • Update TestIcebergUtil to mark LZ4 and ZSTD as supported in matrix
  • Simplify FileFormat benchmark writer to drop unsupported fallback and use direct codec class names
presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java
presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedSmokeTestBase.java
presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergUtil.java
presto-hive/src/test/java/com/facebook/presto/hive/benchmark/FileFormat.java

Possibly linked issues


Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

@hantangwangd hantangwangd changed the title Feat: Add support for LZ4 and ZSTD compression codecs feat: Add support for LZ4 and ZSTD compression codecs Oct 17, 2025
@hantangwangd hantangwangd force-pushed the add_compression_codecs branch from 39d97ab to 589ee3d Compare October 17, 2025 07:08
@hantangwangd hantangwangd changed the title feat: Add support for LZ4 and ZSTD compression codecs feat: Add support for LZ4 and ZSTD compression codecs Oct 17, 2025
@PingLiuPing
Copy link
Contributor

This change enables LZ4 on Parquet and ZSTD on ORC for data writers in Hive and Iceberg

How about ZSTD on parquet?

@hantangwangd
Copy link
Member Author

How about ZSTD on parquet?

Oh, I made a mistake. It should be adding support for ZSTD on Parquet and LZ4 on ORC. I've updated the PR description. Thanks for pointing out this.

@hantangwangd hantangwangd force-pushed the add_compression_codecs branch from 589ee3d to cf20c03 Compare October 17, 2025 10:02
@hantangwangd hantangwangd marked this pull request as ready for review October 17, 2025 13:24
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:5722-5726` </location>
<code_context>
+            assertQuery(format("SELECT sum(custkey) FROM %s", tableName), "SELECT sum(custkey) FROM orders");
+            assertQuerySucceeds(format("DROP TABLE %s", tableName));
+        }
+        else {
+            assertQueryFails(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
+                            tableName, format.name()),
+                    format("%s compression is not supported with %s", codec, format));
</code_context>

<issue_to_address>
**suggestion (testing):** Consider adding assertions for error messages when unsupported codecs are used.

Adding assertions for specific error messages will ensure that error handling remains consistent and prevent regressions in messaging.

```suggestion
        else {
            String expectedErrorMessage = format("%s compression is not supported with %s", codec, format);
            assertQueryFails(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
                            tableName, format.name()),
                    expectedErrorMessage);
            // Additional assertion to verify the error message is present in the thrown exception
            try {
                getQueryRunner().execute(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
                        tableName, format.name()));
                fail("Expected query to fail due to unsupported compression codec");
            }
            catch (RuntimeException e) {
                assertTrue(e.getMessage().contains(expectedErrorMessage),
                        format("Error message should contain: '%s', but was: '%s'", expectedErrorMessage, e.getMessage()));
            }
        }
```
</issue_to_address>

### Comment 2
<location> `presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java:3109-3110` </location>
<code_context>
+            assertQuery(format("SELECT sum(custkey) FROM %s", tableName), "SELECT sum(custkey) FROM orders");
+            assertQuerySucceeds(format("DROP TABLE %s", tableName));
+        }
+        else {
+            assertQueryFails(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
+                            tableName, format.name()),
</code_context>

<issue_to_address>
**suggestion (testing):** Consider asserting the error message for unsupported codec/format combinations.

Please add assertions to verify that the error message for unsupported codec/format combinations matches the expected output.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@tdcmeehan tdcmeehan self-assigned this Oct 17, 2025
steveburnett
steveburnett previously approved these changes Oct 17, 2025
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull branch, local doc build. Thanks!

@steveburnett
Copy link
Contributor

I approved the doc, but I have a question for you to consider. Your release note mentions Hive Connector Changes as well as the Iceberg Connector Changes.

Should the Hive Configuration Properties descriptions for hive.storage-format and hive.compression-codec be updated with these recent updates in the Iceberg connector?

@hantangwangd
Copy link
Member Author

@steveburnett Thanks for your comment. After double-checking, I found that Hive's documentation for the hive.compression-codec property is quite general: The compression codec to use when writing files. So it seems that this modification doesn't actually require any changes to Hive's documentation to maintain accuracy.

Besides, given that Hive supports many other file formats in addition to Parquet and ORC, creating a detailed guide on which compression codecs work with each format—like Iceberg has—might be better suited for a dedicated PR. What's your opinion?

@steveburnett
Copy link
Contributor

Besides, given that Hive supports many other file formats in addition to Parquet and ORC, creating a detailed guide on which compression codecs work with each format—like Iceberg has—might be better suited for a dedicated PR. What's your opinion?

You make a great suggestion!

Given the added work, I agree that improving the Hive documentation for these properties to the same level of detail as we're achieving in the Iceberg doc definitely deserves a separate PR. Including that new research and doc work in this PR would significantly expand the scope of work for this PR much more than I consider reasonable.

Copy link
Contributor

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.
Just few nits.

PingLiuPing
PingLiuPing previously approved these changes Oct 20, 2025
Copy link
Contributor

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I will pick this commit to build a image and run some perf test.
Will let you know the result.

@hantangwangd hantangwangd force-pushed the add_compression_codecs branch 2 times, most recently from d234e8e to 572b005 Compare October 20, 2025 13:14
Copy link
Contributor

@PingLiuPing PingLiuPing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@hantangwangd hantangwangd linked an issue Oct 21, 2025 that may be closed by this pull request
@hantangwangd hantangwangd force-pushed the add_compression_codecs branch from 572b005 to 8295fc5 Compare October 21, 2025 13:30
@steveburnett
Copy link
Contributor

Besides, given that Hive supports many other file formats in addition to Parquet and ORC, creating a detailed guide on which compression codecs work with each format—like Iceberg has—might be better suited for a dedicated PR. What's your opinion?

You make a great suggestion!

Given the added work, I agree that improving the Hive documentation for these properties to the same level of detail as we're achieving in the Iceberg doc definitely deserves a separate PR. Including that new research and doc work in this PR would significantly expand the scope of work for this PR much more than I consider reasonable.

I opened #26384 for the Hive documentation improvement.

@hantangwangd
Copy link
Member Author

I opened #26384 for the Hive documentation improvement.

Thanks @steveburnett

These codecs are available in the writers, but don't seem to have been
configured correctly. Trying to write tables with these formats
previously threw errors. This change enables ZSTD on Parquet and
LZ4 on ORC for data writers in Hive and Iceberg
@hantangwangd hantangwangd force-pushed the add_compression_codecs branch from 8295fc5 to 39fedf0 Compare October 22, 2025 04:21
@hantangwangd hantangwangd merged commit b92b8f2 into prestodb:master Oct 22, 2025
110 of 111 checks passed
@hantangwangd hantangwangd deleted the add_compression_codecs branch October 22, 2025 07:21
@PingLiuPing
Copy link
Contributor

PingLiuPing commented Oct 22, 2025

@hantangwangd

I run tpch load on a 8 workers cluster, each with 16 vCPU and 128 GB memory. And backend storage is S3.
With SF100:
SNAPPY -> 780s
GZIP -> 780s
ZSTD -> 407s

With SF1000:
SNAPPY -> 6780s
GZIP -> 6720s
ZSTD -> 3450s

@hantangwangd
Copy link
Member Author

I run tpch load on a 8 workers cluster, each with 16 vCPU and 128 GB memory. And backend storage is S3. With SF100: SNAPPY -> 780s GZIP -> 780s ZSTD -> 407s

With SF1000: SNAPPY -> 6780s GZIP -> 6720s ZSTD -> 3450s

@PingLiuPing Great, this makes setting ZSTD as Iceberg's default compression codec make more sense.

@PingLiuPing
Copy link
Contributor

@PingLiuPing Great, this makes setting ZSTD as Iceberg's default compression codec make more sense.

Yes, are you working on this? If not I can submit a PR for this.

@hantangwangd
Copy link
Member Author

@PingLiuPing I'm not working on this, so please feel free to submit a PR if you're interested.

PingLiuPing added a commit that referenced this pull request Oct 23, 2025
#26399)

## Description
<!---Describe your changes in detail-->

Iceberg use GZIP as the default compression for parquet before version
1.4. See info
[here](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableProperties.java#L144-L147)

And when iceberg connector was introduced to Presto, the version of
Iceberg is 0.9.0. And hence it uses GZIP as the default compression
codec at that time.

Now that Iceberg has changed the default compression codec to ZSTD. And
the iceberg version in Presto has upgraded to 1.8.1. We should change
the default compression codec to ZSTD to align with iceberg.

Moreover, from the performance test result I found that ZSTD has much
better performance over GZIP. See
[results](#26346 (comment)).


## Motivation and Context
<!---Why is this change required? What problem does it solve?-->
<!---If it fixes an open issue, please link to the issue here.-->


## Impact
<!---Describe any public API or user-facing feature change or any
performance impact-->

## Test Plan
<!---Please fill in how you tested your change-->

show session output:

<img width="1477" height="31" alt="Screenshot 2025-10-22 at 11 09 52"
src="https://github.com/user-attachments/assets/f98d0d3b-6a08-4952-b29d-39fdb234f4d8"
/>

The actual data file metadata:

$ parquet-tools inspect 127dbe88-372d-4872-a00f-02669277732e.parquet

############ file meta data ############
created_by:
num_columns: 2
num_rows: 2
num_row_groups: 1
format_version: 1.0
serialized_size: 175


############ Columns ############
c1
c2

############ Column(c1) ############
name: c1
path: c1
max_definition_level: 1
max_repetition_level: 0
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
compression: ZSTD (space_saved: -42%)

############ Column(c2) ############
name: c2
path: c2
max_definition_level: 1
max_repetition_level: 0
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: ZSTD (space_saved: -33%)


## Contributor checklist

- [ ] Please make sure your submission complies with our [contributing
guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md),
in particular [code
style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style)
and [commit
standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards).
- [ ] PR description addresses the issue accurately and concisely. If
the change is non-trivial, a GitHub Issue is referenced.
- [ ] Documented new properties (with its default value), SQL syntax,
functions, or other functionality.
- [ ] If release notes are required, they follow the [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines).
- [ ] Adequate tests were added if applicable.
- [ ] CI passed.
- [ ] If adding new dependencies, verified they have an [OpenSSF
Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or
higher (or obtained explicit TSC approval for lower scores).

## Release Notes
Please follow [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines)
and fill in the release notes below.

```
== RELEASE NOTES ==

Iceberg Connector Changes
* Replace default iceberg compression codec from GZIP to ZSTD.
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for missing compression codecs

5 participants