-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat: Add support for LZ4 and ZSTD compression codecs #26346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add support for LZ4 and ZSTD compression codecs #26346
Conversation
Reviewer's GuideThis PR implements LZ4 support in ORC/PAGEFILE and ZSTD support in Parquet across the Hive and Iceberg connectors by extending the codec enum, updating writer and configuration layers, exposing the compression_codec session property, and parameterizing tests to verify all format/codec combinations. Sequence diagram for data write with new compression codecssequenceDiagram
actor User
participant HiveConnector
participant IcebergConnector
participant Writer
User->>HiveConnector: Create table with ORC/LZ4 or PARQUET/ZSTD
HiveConnector->>Writer: Configure writer with selected codec
Writer->>HiveConnector: Write data using codec
User->>IcebergConnector: Create table with ORC/LZ4 or PARQUET/ZSTD
IcebergConnector->>Writer: Configure writer with selected codec
Writer->>IcebergConnector: Write data using codec
Class diagram for updated HiveCompressionCodec enumclassDiagram
class HiveCompressionCodec {
+NONE
+SNAPPY
+GZIP
+LZ4
+ZSTD
-Optional<Class<? extends CompressionCodec>> codec
-CompressionKind orcCompressionKind
-CompressionCodecName parquetCompressionCodec
-Predicate<HiveStorageFormat> supportedStorageFormats
+getOrcCompressionKind()
+getParquetCompressionCodec()
+isSupportedStorageFormat(HiveStorageFormat)
}
HiveCompressionCodec --> HiveStorageFormat
HiveCompressionCodec --> CompressionKind
HiveCompressionCodec --> CompressionCodecName
HiveCompressionCodec --> CompressionCodec
Class diagram for ParquetWriter compression codec handlingclassDiagram
class ParquetWriter {
+ParquetWriter(..., String compressionCodecClass)
-OutputStreamSliceOutput outputStream
-List<String> names
-CompressionCodec getCompressionCodec(String compressionCodecClass)
}
class OutputStreamSliceOutput
class CompressionCodecClass
ParquetWriter --> OutputStreamSliceOutput
ParquetWriter --> CompressionCodecClass
Class diagram for IcebergUtil.populateTableProperties changesclassDiagram
class IcebergUtil {
+populateTableProperties(IcebergAbstractMetadata, Table, HiveCompressionCodec, ...)
}
IcebergUtil --> HiveCompressionCodec
Class diagram for ConfigurationUtils.setCompressionProperties changesclassDiagram
class ConfigurationUtils {
-setCompressionProperties(Configuration config, HiveCompressionCodec compression)
}
ConfigurationUtils --> HiveCompressionCodec
Class diagram for session property exposure in HiveSessionProperties and IcebergSessionPropertiesclassDiagram
class HiveSessionProperties {
+COMPRESSION_CODEC
}
class IcebergSessionProperties {
+COMPRESSION_CODEC
}
File-Level Changes
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
39d97ab to
589ee3d
Compare
How about ZSTD on parquet? |
Oh, I made a mistake. It should be adding support for ZSTD on Parquet and LZ4 on ORC. I've updated the PR description. Thanks for pointing out this. |
589ee3d to
cf20c03
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `presto-hive/src/test/java/com/facebook/presto/hive/TestHiveIntegrationSmokeTest.java:5722-5726` </location>
<code_context>
+ assertQuery(format("SELECT sum(custkey) FROM %s", tableName), "SELECT sum(custkey) FROM orders");
+ assertQuerySucceeds(format("DROP TABLE %s", tableName));
+ }
+ else {
+ assertQueryFails(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
+ tableName, format.name()),
+ format("%s compression is not supported with %s", codec, format));
</code_context>
<issue_to_address>
**suggestion (testing):** Consider adding assertions for error messages when unsupported codecs are used.
Adding assertions for specific error messages will ensure that error handling remains consistent and prevent regressions in messaging.
```suggestion
else {
String expectedErrorMessage = format("%s compression is not supported with %s", codec, format);
assertQueryFails(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
tableName, format.name()),
expectedErrorMessage);
// Additional assertion to verify the error message is present in the thrown exception
try {
getQueryRunner().execute(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
tableName, format.name()));
fail("Expected query to fail due to unsupported compression codec");
}
catch (RuntimeException e) {
assertTrue(e.getMessage().contains(expectedErrorMessage),
format("Error message should contain: '%s', but was: '%s'", expectedErrorMessage, e.getMessage()));
}
}
```
</issue_to_address>
### Comment 2
<location> `presto-iceberg/src/test/java/com/facebook/presto/iceberg/IcebergDistributedTestBase.java:3109-3110` </location>
<code_context>
+ assertQuery(format("SELECT sum(custkey) FROM %s", tableName), "SELECT sum(custkey) FROM orders");
+ assertQuerySucceeds(format("DROP TABLE %s", tableName));
+ }
+ else {
+ assertQueryFails(session, format("CREATE TABLE %s WITH (format = '%s') AS SELECT * FROM orders",
+ tableName, format.name()),
</code_context>
<issue_to_address>
**suggestion (testing):** Consider asserting the error message for unsupported codec/format combinations.
Please add assertions to verify that the error message for unsupported codec/format combinations matches the expected output.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! (docs)
Pull branch, local doc build. Thanks!
|
I approved the doc, but I have a question for you to consider. Your release note mentions Hive Connector Changes as well as the Iceberg Connector Changes. Should the Hive Configuration Properties descriptions for |
|
@steveburnett Thanks for your comment. After double-checking, I found that Hive's documentation for the Besides, given that Hive supports many other file formats in addition to Parquet and ORC, creating a detailed guide on which compression codecs work with each format—like Iceberg has—might be better suited for a dedicated PR. What's your opinion? |
You make a great suggestion! Given the added work, I agree that improving the Hive documentation for these properties to the same level of detail as we're achieving in the Iceberg doc definitely deserves a separate PR. Including that new research and doc work in this PR would significantly expand the scope of work for this PR much more than I consider reasonable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
Just few nits.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java
Show resolved
Hide resolved
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java
Outdated
Show resolved
Hide resolved
presto-parquet/src/main/java/com/facebook/presto/parquet/writer/ParquetWriter.java
Show resolved
Hide resolved
cf20c03 to
2228d73
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
I will pick this commit to build a image and run some perf test.
Will let you know the result.
d234e8e to
572b005
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
572b005 to
8295fc5
Compare
I opened #26384 for the Hive documentation improvement. |
Thanks @steveburnett |
These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables ZSTD on Parquet and LZ4 on ORC for data writers in Hive and Iceberg
8295fc5 to
39fedf0
Compare
|
I run tpch load on a 8 workers cluster, each with 16 vCPU and 128 GB memory. And backend storage is S3. With SF1000: |
@PingLiuPing Great, this makes setting ZSTD as Iceberg's default compression codec make more sense. |
Yes, are you working on this? If not I can submit a PR for this. |
|
@PingLiuPing I'm not working on this, so please feel free to submit a PR if you're interested. |
#26399) ## Description <!---Describe your changes in detail--> Iceberg use GZIP as the default compression for parquet before version 1.4. See info [here](https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/TableProperties.java#L144-L147) And when iceberg connector was introduced to Presto, the version of Iceberg is 0.9.0. And hence it uses GZIP as the default compression codec at that time. Now that Iceberg has changed the default compression codec to ZSTD. And the iceberg version in Presto has upgraded to 1.8.1. We should change the default compression codec to ZSTD to align with iceberg. Moreover, from the performance test result I found that ZSTD has much better performance over GZIP. See [results](#26346 (comment)). ## Motivation and Context <!---Why is this change required? What problem does it solve?--> <!---If it fixes an open issue, please link to the issue here.--> ## Impact <!---Describe any public API or user-facing feature change or any performance impact--> ## Test Plan <!---Please fill in how you tested your change--> show session output: <img width="1477" height="31" alt="Screenshot 2025-10-22 at 11 09 52" src="https://github.com/user-attachments/assets/f98d0d3b-6a08-4952-b29d-39fdb234f4d8" /> The actual data file metadata: $ parquet-tools inspect 127dbe88-372d-4872-a00f-02669277732e.parquet ############ file meta data ############ created_by: num_columns: 2 num_rows: 2 num_row_groups: 1 format_version: 1.0 serialized_size: 175 ############ Columns ############ c1 c2 ############ Column(c1) ############ name: c1 path: c1 max_definition_level: 1 max_repetition_level: 0 physical_type: INT32 logical_type: None converted_type (legacy): NONE compression: ZSTD (space_saved: -42%) ############ Column(c2) ############ name: c2 path: c2 max_definition_level: 1 max_repetition_level: 0 physical_type: INT64 logical_type: None converted_type (legacy): NONE compression: ZSTD (space_saved: -33%) ## Contributor checklist - [ ] Please make sure your submission complies with our [contributing guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md), in particular [code style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style) and [commit standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards). - [ ] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [ ] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [ ] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [ ] Adequate tests were added if applicable. - [ ] CI passed. - [ ] If adding new dependencies, verified they have an [OpenSSF Scorecard](https://securityscorecards.dev/#the-checks) score of 5.0 or higher (or obtained explicit TSC approval for lower scores). ## Release Notes Please follow [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines) and fill in the release notes below. ``` == RELEASE NOTES == Iceberg Connector Changes * Replace default iceberg compression codec from GZIP to ZSTD. ```
Co-authored-by: Zac Blanco [email protected]
Description
These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 on ORC and ZSTD on Parquet for data writers in Hive and Iceberg
Address issue #26334
Motivation and Context
Impact
Test Plan
Contributor checklist
Release Notes