Skip to content

Conversation

codluca
Copy link
Member

@codluca codluca commented Sep 22, 2025

Add support for TopN pushdown for Oracle

Fixes #26566

Description

The newest FETCH FIRST number ROWS ONLY syntax is used (supported in Oracle version >=12).

Trino no longer supports older versions of Oracle (i.e. v11), which support only WHERE ROWNUM <= number syntax.

We applied the FETCH FIRST syntax to the LIMIT expression, which was converted before to "SELECT * FROM (%s ORDER BY %s) WHERE ROWNUM <= %s".

Oracle sets NULL values last by default. Sometimes, however, the default does not work, so we used explicit flag for nulls.

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(X) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

## Section
* Fix some things. ({issue}`26566`)

@codluca codluca force-pushed the 26566-support-topn-pushdown-for-oracle branch 2 times, most recently from 38c39d6 to ee583ef Compare October 4, 2025 11:19
@findepi
Copy link
Member

findepi commented Oct 6, 2025

Older versions of Oracle (i.e. v11) support only WHERE ROWNUM <= number syntax, not FETCH FIRST number ROWS ONLY.

according a google search, Oracle 11 is EOL and we generally do not support old versions of 3rd party products that are already sunset.

it should be fine to move on, especially when it simplifies connector code as it is in this case

@codluca
Copy link
Member Author

codluca commented Oct 7, 2025

Older versions of Oracle (i.e. v11) support only WHERE ROWNUM <= number syntax, not FETCH FIRST number ROWS ONLY.

according a google search, Oracle 11 is EOL and we generally do not support old versions of 3rd party products that are already sunset.

it should be fine to move on, especially when it simplifies connector code as it is in this case

Should the upgrade be done as part of this task?
This implies modifying applyLimit() also.
And upgrading the container image and, possibly, some unit tests.

TestingOracleServer and TestJdbcVendorCompatibility use gvenzl/oracle-xe:11.2.0.2-full.
TestDbResourceGroupsOracleFlywayMigration uses gvenzl/oracle-xe:18.4.0-slim
Should we upgrade to gvenzl/oracle-xe:18.4.0-full or gvenzl/oracle-xe:18.4.0-slim?

@github-actions github-actions bot added the jdbc Relates to Trino JDBC driver label Oct 7, 2025
@ebyhr
Copy link
Member

ebyhr commented Oct 7, 2025

Could you rebase on master branch? We've already upgraded the version in #26573 thanks to @takezoe

@codluca codluca force-pushed the 26566-support-topn-pushdown-for-oracle branch 2 times, most recently from 82b935e to ffb0956 Compare October 7, 2025 12:04
@codluca codluca force-pushed the 26566-support-topn-pushdown-for-oracle branch from 61d42c2 to ed60393 Compare October 7, 2025 12:07
@codluca
Copy link
Member Author

codluca commented Oct 7, 2025

Could you rebase on master branch? We've already upgraded the version in #26573 thanks to @takezoe

Rebased. Thanks

@codluca codluca force-pushed the 26566-support-topn-pushdown-for-oracle branch 2 times, most recently from 4c33a89 to 1149ac8 Compare October 13, 2025 08:31
Comment on lines 660 to 661
// In Oracle BINARY collation provides byte-based comparison
collation = "COLLATE BINARY";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Treating the bytes as signed or unsigned?

It's worth a code comment explaining that Trino string codepoint-based collation is equivalent to sorting (unsigned) bytes ... in UTF-8 encoding.
When we sent COLLATE BINARY to oracle, are we assuming the data is compared byte-wise on Utf-8 encoding, or can it be something else?

Copy link
Member Author

@codluca codluca Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this comment:

Used COLLATE BINARY for ORDER BY for string types, in order to enforce a comparison similar to Trino.

Trino encodes all text as UTF-8, and sorts text as Unicode codepoints.

In Oracle BINARY collation provides unsigned byte-by-byte comparison.

If the Oracle DB uses UTF8 or AL32UTF8 charset, then BINARY collation will match Trino's sorting.

If the Oracle DB uses a non UTF8 charset, like LATIN1, then characters outside ASCII range
will not sort the same way as in Trino.

We have a limitation here, which seems similar to what exists for PostgreSQL connector.
PostgreSQL COLLATE "C" and Oracle COLLATE BINARY will do a byte-for-byte comparison. If the DB encoding is UTF, then this will match Trino's UTF-8 codepoint comparison.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks. that is a good comment.
Is this why PostgreSQL connector uses COLLATE C?

If the Oracle DB uses a non UTF8 charset, like LATIN1, then characters outside ASCII range
will not sort the same way as in Trino.

this sounds like a potential correctness issue. thanks for calling it out! cc @ebyhr @martint
I am not an Oracle DBA. Is it a known figure how often Oracle DBs uses non-UTF8 charset? Is it a theoretical problem or something to address? Can this be addressed automatically?

from https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-A3C0D836-BADB-44E5-A5D4-265BA5968483

NVARCHAR2(size) Variable-length Unicode character string having maximum length size characters. You must specify size for NVARCHAR2. The number of bytes can be up to two times size for AL16UTF16 encoding and three times size for UTF8 encoding.

it sounds like Oracle assumes each codepoint is at most 3 bytes in UTF-8 encoding, whereas actual UTF-8 can produce 4 bytes for a single codepoint.
What does Oracle do when string contains codepoints > U+FFFF (outside Basic Multilingual Plane, like emojis)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add a boolean property to enable/disable the topN pushdown. The pushdown could be enabled by default, and the user could disable it in case the character encoding in not UTF8.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to inspect oracle server side setting to know whether pushdown is safe to do?

@codluca codluca force-pushed the 26566-support-topn-pushdown-for-oracle branch from 258eca4 to e713e73 Compare October 13, 2025 18:08
@codluca codluca force-pushed the 26566-support-topn-pushdown-for-oracle branch from 4efe27d to b822377 Compare October 15, 2025 17:01
The newest FETCH FIRST number ROWS ONLY syntax is used (supported in Oracle version >=12).

Trino no longer supports older versions of Oracle (i.e. v11), which support only WHERE ROWNUM <= number syntax.

We applied the FETCH FIRST syntax to the LIMIT expression, which was converted before to "SELECT * FROM (%s ORDER BY %s) WHERE ROWNUM <= %s".

Oracle sets NULL values last by default. Sometimes, however, the default does not work, so we used explicit flag for nulls.

Used COLLATE BINARY for ORDER BY for string types, in order to enforce a comparison similar to Trino.

Trino encodes all text as UTF-8, and sorts text as Unicode codepoints.

In Oracle BINARY collation provides unsigned byte-by-byte comparison.

If the Oracle DB uses UTF8 or AL32UTF8 charset, then BINARY collation will match Trino's sorting.

If the Oracle DB uses a non UTF8 charset, like WE8ISO8859P1 or WE8MSWIN1252,
then Oracle will not sort the same way as in Trino.
@codluca codluca force-pushed the 26566-support-topn-pushdown-for-oracle branch from 0702643 to 893305a Compare October 15, 2025 17:20
Comment on lines +677 to +678
// If the Oracle DB uses a non UTF8 charset, like WE8ISO8859P1 or WE8MSWIN1252,
// then Oracle will not sort the same way as in Trino.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kokosing please help me understand whether this is a very unlikely edge case that we can leave to be addressed later, or a pretty common thing, requiring this new feature to be opt-in

also, do we need a kill-switch?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collations are common.

@findepi findepi requested a review from kokosing October 16, 2025 08:33
Comment on lines +677 to +678
// If the Oracle DB uses a non UTF8 charset, like WE8ISO8859P1 or WE8MSWIN1252,
// then Oracle will not sort the same way as in Trino.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collations are common.


private boolean isCollatableInOracle(JdbcColumnHandle column)
{
String jdbcTypeName = column.getJdbcTypeHandle().jdbcTypeName()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have io.trino.plugin.jdbc.JdbcTypeHandle#caseSensitivity. Can we use it to determine types with the same collation as in Trino?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed jdbc Relates to Trino JDBC driver oracle Oracle connector

Development

Successfully merging this pull request may close these issues.

Support TopN pushdown for Oracle connector

6 participants