-
Notifications
You must be signed in to change notification settings - Fork 5.5k
feat: Add missing LZ4 and ZSTD compression codec classes #25021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add missing LZ4 and ZSTD compression codec classes #25021
Conversation
4f83cd2 to
6c8b2a9
Compare
presto-hive/src/main/java/com/facebook/presto/hive/HiveCompressionCodec.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good to me, just one nit.
presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergSessionProperties.java
Outdated
Show resolved
Hide resolved
6c8b2a9 to
97778fe
Compare
97778fe to
74a3adf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After checking the code in detail, I found that we haven't really support LZ4 for PARQUET for now. Referring to the code here. So do you think it makes sense to allow LZ4 configuration once it's really supported for PARQUET?
| assertQuerySucceeds(session, format("CREATE TABLE %s (i bigint) WITH (\"write.format.default\" = '%s')", tableName, format.name())); | ||
| assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName)); | ||
| assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name())); | ||
| assertUpdate(format("INSERT INTO %s SELECT num FROM UNNEST(sequence(0, 1000)) as t(num)", tableName), "VALUES 1001"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| assertQuerySucceeds(session, format("CREATE TABLE %s (i bigint) WITH (\"write.format.default\" = '%s')", tableName, format.name())); | |
| assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName)); | |
| assertQuery(format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name())); | |
| assertUpdate(format("INSERT INTO %s SELECT num FROM UNNEST(sequence(0, 1000)) as t(num)", tableName), "VALUES 1001"); | |
| assertQuerySucceeds(session, format("CREATE TABLE %s WITH (\"write.format.default\" = '%s') as select * from lineitem with no data", tableName, format.name())); | |
| assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.%s.compression-codec'", tableName, format.name().toLowerCase(ROOT)), format("VALUES '%s'", codecName)); | |
| assertQuery(session, format("SELECT value FROM \"%s$properties\" WHERE key = 'write.format.default'", tableName), format("VALUES '%s'", format.name())); | |
| assertUpdate(session, format("INSERT INTO %s SELECT * from lineitem", tableName), "select count(*) from lineitem"); | |
| assertQuery(session, format("SELECT * FROM %s", tableName), "select * from lineitem"); |
It seems that if we do the insertion using the session that including a compression codec, we will get test fails on PARQUET + LZ4/ZSTD.
For ZSTD, we might try to fetch its corresponding CompressionCodecName using an mismatched fully qualified name. Referring to here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is a good catch! I have fixed the writers by adding support for LZ4 and Zstd when set in the session
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the following statements still fail for LZ4:
assertQuerySucceeds(session, format("CREATE TABLE %s WITH (\"write.format.default\" = '%s') as select * from lineitem with no data", tableName, format.name()));
assertUpdate(session, format("INSERT INTO %s SELECT * from lineitem", tableName), "select count(*) from lineitem");
assertQuery(format("SELECT * FROM %s", tableName), "select * from lineitem");
An error occurs when querying the actual parquet data compressed with LZ4. It seems there exists a problem with the airlift LZ4 compressor, as described in ParquetCompressor.getCompressor:
// When using airlift LZO or LZ4 compressor, decompressing page in reader throws exception.
......
The error information is as follows:
java.lang.IllegalArgumentException: Invalid offset or length (8, 16782243) in array of length 60232
at io.airlift.compress.lz4.Lz4Decompressor.verifyRange(Lz4Decompressor.java:108)
at io.airlift.compress.lz4.Lz4Decompressor.decompress(Lz4Decompressor.java:34)
at com.facebook.presto.parquet.ParquetCompressionUtils.decompress(ParquetCompressionUtils.java:151)
......
74a3adf to
7624603
Compare
|
I don't find the |
It appears that |
Great question! You're correct, and both hive.compression-codec and iceberg.compression-codec are documented in Configuration Properties topics in the Hive Connector and Iceberg Connector pages. I'm seeing three properties to discuss: 1 For the Hive and Iceberg connector-specific session properties, I agree that it would be appropriate to add them in the corresponding connector docs. 2 3 Revising the Hive Connector page to add a new Session Properties topic - which would imply gathering the various references to session properties to populate it - seems a large piece of work that isn't appropriate to add to this PR. I will open a doc issue about that reorganization of the Hive Connector page. In the meantime, what do you think of adding a mention of |
Doc issue created, see #25110. |
I just confirmed that there is no
Sounds great to me. |
These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 and ZSTD compression for Parquet writers in Hive and Iceberg
7624603 to
c364bdc
Compare
|
Closing this in favor of PR #26346 |
Description
These codecs are available in the writers, but don't seem to have been configured correctly. Trying to write tables with these formats previously threw errors. This change enables LZ4 and ZSTD compression for Parquet writers in Iceberg and Hive
Motivation and Context
When users set the
compression_codecsession property or*.compression-codecconnector property with LZ4 or ZSTD with parquet format as the default, tables would fail to be created due to the codec being null inside HiveCompressionCodec inside of Iceberg. I couldn't find a good reason for keeping these null, so I populated the correct enum variants and added tests to ensure they worked. Since this code is shared between Iceberg and Hive connectors, I added tests for different file type and compression codec variants to ensure we have compatibility across all of the potential configuration combinations.Impact
LZ4andZSTDwhen creating iceberg tables with parquet as the default file formatTest Plan
Contributor checklist
Release Notes