-
Notifications
You must be signed in to change notification settings - Fork 176
Support project expression pushdown with derived field script #4288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Support project expression pushdown with derived field script #4288
Conversation
Signed-off-by: Lantao Jin <[email protected]>
Signed-off-by: Lantao Jin <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
OpenSearchIndexScanRule::isScriptProjectPushed) | ||
.and(OpenSearchIndexScanRule::isProjectPushed) | ||
.and(OpenSearchIndexScanRule::noLimitPushed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we don't need such complex condition check. Script project pushdown can be merged to project pushdown method. Will optimize it later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged two kinds of project into one method. This reduces the dependency on each other and reduces the times of rewriting plan. Additionally, if we introduce more rules related to project pushdown, it could be easier to modify current logic.
2f2e0cc
to
debae75
Compare
Signed-off-by: Songkan Tang <[email protected]>
debae75
to
c7e154b
Compare
Signed-off-by: Songkan Tang <[email protected]>
aceee3e
to
5e9ec59
Compare
#4245 has been merged and opensearch-project/OpenSearch#19271 has also been addressed by core. Any other blocker or concern for this PR? @songkant-aws @LantaoJin @yuancu |
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
} | ||
// Ignored Project in cost accumulation, but it will affect the external cost | ||
case PROJECT -> {} | ||
case PROJECT, SCRIPT_PROJECT -> {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we count the cost of SCIRPT_PROJECT
since it should bring more overhead on cluster than PROJECT
? Otherwise it will be too unfair to non-push-down on cost computing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added new cost calculation for SCRIPT_PROJECT
Signed-off-by: Songkan Tang <[email protected]>
for (int i = 0; i < projExprs.size(); i++) { | ||
final RexNode projExpr = projExprs.get(i); | ||
if (isPushableNewDerived(projExpr, derivedIndexSet, scan)) { | ||
final String uniquifiedAlias = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this method, ... | eval a = a + 1
will produce a new derived field a1
, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will produce a0
.
Interestingly, I find although it may physically generate a EnumerableScan with rowType like [age0 BIGINT], the EnumerableScan's schema is still [age BIGINT]. I think this logicRexUtil.isIdentity(newExprs, newScan.getRowType())
decides as long as the enumerator result is correct(RexInput is the same), it doesn't care what's the actual scan's rowType.
Added an IT called testFieldsWithNameConflictDerivedFieldPushdown
to ensure query correctness.
if (isSequential && !Objects.equals(integer, current++)) { | ||
public boolean add(SelectedColumn item) { | ||
if (isSequential | ||
&& item.getKind() == Kind.PHYSICAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should Kind.DERIVED_EXISTING
be included here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
if (!seenOldIndex.get(oldIdx)) { | ||
seenOldIndex.set(oldIdx); | ||
if (derivedIndexSet.get(oldIdx)) { | ||
selected.add(SelectedColumn.derivedExisting(oldIdx)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the function of distinguishing DERIVED_EXISTING from PHYSICAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking if the reason why we need DERIVED_EXISTING is for the case of SCAN-PROJECT-PROJECT
. Or shall we only handle the case of SCAN-PROJECT
which should be produced by project merge rule while prevent project push down if there is already SCRIPT_PROJECT pushed in scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, sometimes the plan can't always merge projects by ProjectMergeRule. For example, project(a) - sort a + b - project(a, a + b) - scan
. a + b
expression is a kind of complex expression that requires a immediate followup sort. In this case, it would be more straightforward to allow multiple project pushdown, although it seems inner logic is more complex.
Also, allowing multiple project pushdown brings more flexibility. If we don't see this requirement in future, we can disable it.
final int pos = projIdxToNewPos.get(i); | ||
newExprs.add(call.builder().getRexBuilder().makeInputRef(projExprs.get(i).getType(), pos)); | ||
} else { | ||
newExprs.add(RexUtil.apply(oldIdxToNewPos, projExprs.get(i))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RexUtil.apply will create a new shuttles when calling. It seems expensive to create that every time for each projExpr
although the shuttles should be the same one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating a new extended RexPermuteInputsShuttle and AbstractMapping by ourselves? That shuttle should be able to handle all kinds of SelectedItems. Then the mapping construction process and expression transformation process could be simplified. I used to do a similar work in above draft PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// Script filter of derived field input is not supported | ||
.and( | ||
Predicate.not( | ||
OpenSearchIndexScanRule::isScriptProjectPushed)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if push down agg/sort on derived field? Could you please add a test for that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I banned the agg pushdown and complex sort expression pushdown on derived field. Agg pushdown will match agg - project - scan and optimize it with our own rule. Add a test case for testScriptSort
. Existing agg test should already take care of agg pushdown.
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Signed-off-by: Songkan Tang <[email protected]>
Could you review this PR with another look? @qianheng-aws @LantaoJin @yuancu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
"source=opensearch-sql_test_index_account" | ||
+ "| eval age = age + 2" | ||
+ "| fields age, lastname")); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems in the plan the new age field becomes age0
. I'm curious where is it set back to name age
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems the column names in our final results are derived from the original plan(i.e. logical plan), so the final plan(i.e. physical plan) is allowed to produce a different row type as long as the types can match.
@LantaoJin Please take another look at this PR. |
Description
Support project expression pushdown with derived field script.
This is the first phase of script project pushdown with partial script project supported. Follow-up like supporting partial filter pushdown after script project pushdown will be implemented later.
DerivedFieldScript Pros and Cons:
Pros:
Cons:
Script field Pros and Cons:
Pros:
Cons:
Benchmark Results After Optimization
CalcitePPLBig5IT:
Summary:
asc_sort_timestamp: 8 ms
asc_sort_timestamp_can_match_shortcut: 13 ms
asc_sort_timestamp_no_can_match_shortcut: 12 ms
asc_sort_with_after_timestamp: 9 ms
bin_bins: 7 ms
bin_span_log: 8 ms
bin_span_time: 16 ms
composite_date_histogram_daily: 23 ms
composite_terms: 52 ms
composite_terms_keyword: 27 ms
date_histogram_hourly_agg: 13 ms
date_histogram_minute_agg: 22 ms
default: 9 ms
desc_sort_timestamp: 10 ms
desc_sort_timestamp_can_match_shortcut: 16 ms
desc_sort_timestamp_no_can_match_shortcut: 24 ms
desc_sort_with_after_timestamp: 9 ms
keyword_in_range: 23 ms
keyword_terms: 17 ms
keyword_terms_low_cardinality: 13 ms
multi_terms_keyword: 25 ms
query_string_on_message: 14 ms
query_string_on_message_filtered: 34 ms
query_string_on_message_filtered_sorted_num: 39 ms
range: 12 ms
range_auto_date_histo: 37 ms
range_auto_date_histo_with_metrics: 72 ms
range_field_conjunction_big_range_big_term_query: 10 ms
range_field_conjunction_small_range_big_term_query: 8 ms
range_field_conjunction_small_range_small_term_query: 15 ms
range_field_disjunction_big_range_small_term_query: 10 ms
range_numeric: 11 ms
range_with_asc_sort: 17 ms
range_with_desc_sort: 15 ms
scroll: 7 ms
sort_keyword_can_match_shortcut: 14 ms
sort_keyword_no_can_match_shortcut: 14 ms
sort_numeric_asc: 14 ms
sort_numeric_asc_with_match: 17 ms
sort_numeric_desc: 22 ms
sort_numeric_desc_with_match: 15 ms
term: 17 ms
terms_significant_1: 19 ms
terms_significant_2: 16 ms
Total 44 queries succeed. Average duration: 18 ms
CalcitePPLClickBenchIT:
Summary:
q1: 21 ms
q10: 59 ms
q11: 26 ms
q12: 31 ms
q13: 15 ms
q14: 21 ms
q15: 18 ms
q16: 13 ms
q17: 14 ms
q18: 10 ms
q19: 19 ms
q2: 18 ms
q20: 8 ms
q21: 11 ms
q22: 19 ms
q23: 24 ms
q24: 17 ms
q25: 15 ms
q26: 13 ms
q27: 15 ms
q28: 32 ms
q3: 21 ms
q31: 29 ms
q32: 31 ms
q33: 24 ms
q34: 12 ms
q35: 15 ms
q36: 18 ms
q37: 24 ms
q38: 22 ms
q39: 23 ms
q4: 16 ms
q40: 26 ms
q41: 27 ms
q42: 23 ms
q43: 31 ms
q5: 12 ms
q6: 12 ms
q7: 17 ms
q8: 21 ms
q9: 23 ms
Total 41 queries succeed. Average duration: 20 ms
Related Issues
Resolves #3387
Check List
--signoff
or-s
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.