-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Remove early phase failure in batched #136889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Hi @benchaplin, I've created a changelog YAML for you. |
this.results = in.readArray(i -> i.readBoolean() ? new QuerySearchResult(i) : i.readException(), Object[]::new); | ||
this.mergeResult = QueryPhaseResultConsumer.MergeResult.readFrom(in); | ||
this.topDocsStats = SearchPhaseController.TopDocsStats.readFrom(in); | ||
boolean hasReductionFailure = in.readBoolean(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're changing the shape of this message, do we need to create a new transport version or is that taken care of for us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I believe I do, once I learn how 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Resolves #134151, #130821.
Background
A bug was introduced by #121885 due to the following code, which handles batched query exceptions due to a batched partial reduction failure:
elasticsearch/server/src/main/java/org/elasticsearch/action/search/SearchQueryThenFetchAsyncAction.java
Lines 525 to 544 in bd35649
Raising a phase failure in this way leads to a couple issues:
Solution
Problem 1 could be resolved with a simple flag, as proposed in #131085. Problem 2 could be resolved with some careful use of the same flag to clean contexts upon receiving stale query results. However, in the interest of stability, I propose a solution that more closely resembles how a reduction failure is handled by a non-batched query phase. In non-batched, a reduction failure is held in the QueryPhaseResultConsumer until shard fanout is complete. Only later, during final reduction at the beginning of the fetch phase, do we fail the search.
Fast failure + proper task cancellation are worthy goals for the future. I am tracking these as follow-up improvements for after the release of batched query execution.
This PR: