[SPARK-52807][SDP] Proto changes to support analysis inside Declarative Pipelines query functions #51502
+74
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Introduces a mechanism for lazy execution of Declarative Pipelines query functions. A query function is something like the
mv1
in this example:Currently, query functions are always executed eagerly. I.e. the implementation of the
materialized_view
decorator immediately invokes the function that it decorates and then registers the resulting DataFrame with the server.This PR introduces Spark Connect proto changes that enable executing query functions later on, initiated by the server during graph resolution. After all datasets and flows have been registered with the server, the server can tell the client to execute the query functions for flows that haven't yet successfully been executed. The way this works is that the client initiates an RPC with the server, and then the server streams back responses that indicate to the client when it's time to execute a query function for one of its flows. Relevant changes:
QueryFunctionFailure
messageQueryFunctionResult
messagerelation
field inDefineFlow
withquery_function_result
fieldDefineFlowQueryFunctionResult
messageGetQueryFunctionExecutionSignalStream
messagePipelineQueryFunctionExecutionSignal
messageThis PR also introduces Spark Connect proto changes that enable carrying out plan analysis "relative to" a dataflow graph. "Relative to" means that, when determining the existence and schema of a table that's defined in the graph, the definitions from the graph is used instead of the definition of the catalog. This will be used in cases where the code inside a query function triggers analysis. Relevant changes:
FlowAnalysisContext
messageWhy are the changes needed?
There are some situations where we can't resolve the relation immediately at the time we're registering a flow.
E.g. consider this situation:
file 1:
file 2:
Unlike some other transformations, which get analyzed lazily,
groupBy
can trigger anAnalyzePlan
Spark Connect request immediately. If the query function for mv2 gets executed before mv1, then it will hit an error, because mv1 doesn't exist yet.groupBy
isn't the only example here.Other examples of these kinds of situations:
spark.sql
is used.Does this PR introduce any user-facing change?
No
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?