Skip to content

Conversation

songkant-aws
Copy link
Contributor

@songkant-aws songkant-aws commented Sep 12, 2025

Description

Support project expression pushdown with derived field script.

This is the first phase of script project pushdown with partial script project supported. Follow-up like supporting partial filter pushdown after script project pushdown will be implemented later.

DerivedFieldScript Pros and Cons:

Pros:

  1. DerivedFieldScript is evaluated at Search phase, which means it allows some non script filtering
  2. Part of aggregations over derived field are supported. See: https://docs.opensearch.org/latest/field-types/supported-field-types/derived/#aggregations

Cons:

  1. Limited type is supported. See: https://docs.opensearch.org/latest/field-types/supported-field-types/derived/#emitting-values-in-scripts. It needs one more layer to process correct data type in Calcite for not supported data types.
  2. Scoring and sorting is not supported

Script field Pros and Cons:

Pros:

  1. The output is Object, it allows Calcite flexibly process the right data type.

Cons:

  1. Script_fields is evaluated post SearchPhase. It doesn't support filtering.
  2. Sorting is not supported as well
  3. Can't be involved in aggregation

Benchmark Results After Optimization

CalcitePPLBig5IT:
Summary:
asc_sort_timestamp: 8 ms
asc_sort_timestamp_can_match_shortcut: 13 ms
asc_sort_timestamp_no_can_match_shortcut: 12 ms
asc_sort_with_after_timestamp: 9 ms
bin_bins: 7 ms
bin_span_log: 8 ms
bin_span_time: 16 ms
composite_date_histogram_daily: 23 ms
composite_terms: 52 ms
composite_terms_keyword: 27 ms
date_histogram_hourly_agg: 13 ms
date_histogram_minute_agg: 22 ms
default: 9 ms
desc_sort_timestamp: 10 ms
desc_sort_timestamp_can_match_shortcut: 16 ms
desc_sort_timestamp_no_can_match_shortcut: 24 ms
desc_sort_with_after_timestamp: 9 ms
keyword_in_range: 23 ms
keyword_terms: 17 ms
keyword_terms_low_cardinality: 13 ms
multi_terms_keyword: 25 ms
query_string_on_message: 14 ms
query_string_on_message_filtered: 34 ms
query_string_on_message_filtered_sorted_num: 39 ms
range: 12 ms
range_auto_date_histo: 37 ms
range_auto_date_histo_with_metrics: 72 ms
range_field_conjunction_big_range_big_term_query: 10 ms
range_field_conjunction_small_range_big_term_query: 8 ms
range_field_conjunction_small_range_small_term_query: 15 ms
range_field_disjunction_big_range_small_term_query: 10 ms
range_numeric: 11 ms
range_with_asc_sort: 17 ms
range_with_desc_sort: 15 ms
scroll: 7 ms
sort_keyword_can_match_shortcut: 14 ms
sort_keyword_no_can_match_shortcut: 14 ms
sort_numeric_asc: 14 ms
sort_numeric_asc_with_match: 17 ms
sort_numeric_desc: 22 ms
sort_numeric_desc_with_match: 15 ms
term: 17 ms
terms_significant_1: 19 ms
terms_significant_2: 16 ms
Total 44 queries succeed. Average duration: 18 ms

CalcitePPLClickBenchIT:
Summary:
q1: 21 ms
q10: 59 ms
q11: 26 ms
q12: 31 ms
q13: 15 ms
q14: 21 ms
q15: 18 ms
q16: 13 ms
q17: 14 ms
q18: 10 ms
q19: 19 ms
q2: 18 ms
q20: 8 ms
q21: 11 ms
q22: 19 ms
q23: 24 ms
q24: 17 ms
q25: 15 ms
q26: 13 ms
q27: 15 ms
q28: 32 ms
q3: 21 ms
q31: 29 ms
q32: 31 ms
q33: 24 ms
q34: 12 ms
q35: 15 ms
q36: 18 ms
q37: 24 ms
q38: 22 ms
q39: 23 ms
q4: 16 ms
q40: 26 ms
q41: 27 ms
q42: 23 ms
q43: 31 ms
q5: 12 ms
q6: 12 ms
q7: 17 ms
q8: 21 ms
q9: 23 ms
Total 41 queries succeed. Average duration: 20 ms

Related Issues

Resolves #3387

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Comment on lines 123 to 125
OpenSearchIndexScanRule::isScriptProjectPushed)
.and(OpenSearchIndexScanRule::isProjectPushed)
.and(OpenSearchIndexScanRule::noLimitPushed)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we don't need such complex condition check. Script project pushdown can be merged to project pushdown method. Will optimize it later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged two kinds of project into one method. This reduces the dependency on each other and reduces the times of rewriting plan. Additionally, if we introduce more rules related to project pushdown, it could be easier to modify current logic.

@songkant-aws songkant-aws changed the title [Feature] Support project expression pushdown with derived field script Support project expression pushdown with derived field script Sep 12, 2025
@songkant-aws songkant-aws marked this pull request as ready for review September 12, 2025 10:45
Signed-off-by: Songkan Tang <[email protected]>
@qianheng-aws
Copy link
Collaborator

#4245 has been merged and opensearch-project/OpenSearch#19271 has also been addressed by core.

Any other blocker or concern for this PR? @songkant-aws @LantaoJin @yuancu

}
// Ignored Project in cost accumulation, but it will affect the external cost
case PROJECT -> {}
case PROJECT, SCRIPT_PROJECT -> {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we count the cost of SCIRPT_PROJECT since it should bring more overhead on cluster than PROJECT? Otherwise it will be too unfair to non-push-down on cost computing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new cost calculation for SCRIPT_PROJECT

for (int i = 0; i < projExprs.size(); i++) {
final RexNode projExpr = projExprs.get(i);
if (isPushableNewDerived(projExpr, derivedIndexSet, scan)) {
final String uniquifiedAlias =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this method, ... | eval a = a + 1 will produce a new derived field a1, does it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will produce a0.

Interestingly, I find although it may physically generate a EnumerableScan with rowType like [age0 BIGINT], the EnumerableScan's schema is still [age BIGINT]. I think this logicRexUtil.isIdentity(newExprs, newScan.getRowType()) decides as long as the enumerator result is correct(RexInput is the same), it doesn't care what's the actual scan's rowType.

Added an IT called testFieldsWithNameConflictDerivedFieldPushdown to ensure query correctness.

if (isSequential && !Objects.equals(integer, current++)) {
public boolean add(SelectedColumn item) {
if (isSequential
&& item.getKind() == Kind.PHYSICAL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should Kind.DERIVED_EXISTING be included here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

if (!seenOldIndex.get(oldIdx)) {
seenOldIndex.set(oldIdx);
if (derivedIndexSet.get(oldIdx)) {
selected.add(SelectedColumn.derivedExisting(oldIdx));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the function of distinguishing DERIVED_EXISTING from PHYSICAL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking if the reason why we need DERIVED_EXISTING is for the case of SCAN-PROJECT-PROJECT. Or shall we only handle the case of SCAN-PROJECT which should be produced by project merge rule while prevent project push down if there is already SCRIPT_PROJECT pushed in scan.

Copy link
Contributor Author

@songkant-aws songkant-aws Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, sometimes the plan can't always merge projects by ProjectMergeRule. For example, project(a) - sort a + b - project(a, a + b) - scan. a + b expression is a kind of complex expression that requires a immediate followup sort. In this case, it would be more straightforward to allow multiple project pushdown, although it seems inner logic is more complex.

Also, allowing multiple project pushdown brings more flexibility. If we don't see this requirement in future, we can disable it.

final int pos = projIdxToNewPos.get(i);
newExprs.add(call.builder().getRexBuilder().makeInputRef(projExprs.get(i).getType(), pos));
} else {
newExprs.add(RexUtil.apply(oldIdxToNewPos, projExprs.get(i)));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RexUtil.apply will create a new shuttles when calling. It seems expensive to create that every time for each projExpr although the shuttles should be the same one.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/opensearch-project/sql/pull/3951/files#diff-5ffab7c85f9c37e1ce56e8742848701dbe1baa77149aed8868189c01f51c8436

How about creating a new extended RexPermuteInputsShuttle and AbstractMapping by ourselves? That shuttle should be able to handle all kinds of SelectedItems. Then the mapping construction process and expression transformation process could be simplified. I used to do a similar work in above draft PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// Script filter of derived field input is not supported
.and(
Predicate.not(
OpenSearchIndexScanRule::isScriptProjectPushed))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if push down agg/sort on derived field? Could you please add a test for that case?

Copy link
Contributor Author

@songkant-aws songkant-aws Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I banned the agg pushdown and complex sort expression pushdown on derived field. Agg pushdown will match agg - project - scan and optimize it with our own rule. Add a test case for testScriptSort. Existing agg test should already take care of agg pushdown.

@songkant-aws
Copy link
Contributor Author

Could you review this PR with another look? @qianheng-aws @LantaoJin @yuancu

Copy link
Collaborator

@yuancu yuancu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +98 to +101
"source=opensearch-sql_test_index_account"
+ "| eval age = age + 2"
+ "| fields age, lastname"));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems in the plan the new age field becomes age0. I'm curious where is it set back to name age

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems the column names in our final results are derived from the original plan(i.e. logical plan), so the final plan(i.e. physical plan) is allowed to produce a different row type as long as the types can match.

@qianheng-aws
Copy link
Collaborator

@LantaoJin Please take another look at this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Calcite Engine Framework: pushdown whole project instead of reference

4 participants