From d0d78cb9851e814b1b6a0585f8d89fb7c649a15e Mon Sep 17 00:00:00 2001 From: Nicholas Chammas Date: Fri, 27 Jun 2025 11:32:21 -0400 Subject: [PATCH 1/2] reorg spark connect guide --- docs/_data/menu-spark-connect.yaml | 4 + .../nav-left-wrapper-spark-connect.html | 6 + docs/_layouts/global.html | 9 +- docs/spark-connect-overview.md | 284 +-------------- ...onnect.md => spark-connect-server-libs.md} | 134 ++------ docs/spark-connect-setup.md | 322 ++++++++++++++++++ docs/sql-performance-tuning.md | 4 +- 7 files changed, 381 insertions(+), 382 deletions(-) create mode 100644 docs/_data/menu-spark-connect.yaml create mode 100644 docs/_includes/nav-left-wrapper-spark-connect.html rename docs/{app-dev-spark-connect.md => spark-connect-server-libs.md} (60%) create mode 100644 docs/spark-connect-setup.md diff --git a/docs/_data/menu-spark-connect.yaml b/docs/_data/menu-spark-connect.yaml new file mode 100644 index 0000000000000..7c07e13a4cd5b --- /dev/null +++ b/docs/_data/menu-spark-connect.yaml @@ -0,0 +1,4 @@ +- text: Setting up Spark Connect + url: spark-connect-setup.html +- text: Extending Spark with Spark Server Libraries + url: spark-connect-server-libs.html diff --git a/docs/_includes/nav-left-wrapper-spark-connect.html b/docs/_includes/nav-left-wrapper-spark-connect.html new file mode 100644 index 0000000000000..bea84110da4ce --- /dev/null +++ b/docs/_includes/nav-left-wrapper-spark-connect.html @@ -0,0 +1,6 @@ +
+
+

Spark Connect Guide

+ {% include nav-left.html nav=include.nav-spark-connect %} +
+
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index 4d6ddfc2c74a1..00957319c07bf 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -75,6 +75,7 @@ RDDs, Accumulators, Broadcasts Vars SQL, DataFrames, and Datasets Structured Streaming + Spark Connect Spark Streaming (DStreams) MLlib (Machine Learning) GraphX (Graph Processing) @@ -155,15 +156,17 @@

Apache Spark - A Unified engine for large-scale da {% endif %}
- {% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "/streaming/" or page.url contains "migration-guide.html" %} + {% if page.url contains "/ml" or page.url contains "/sql" or page.url contains "/streaming/" or page.url contains "migration-guide.html" or page.url contains "/spark-connect" %} {% if page.url contains "migration-guide.html" %} {% include nav-left-wrapper-migration.html nav-migration=site.data.menu-migration %} {% elsif page.url contains "/ml" %} {% include nav-left-wrapper-ml.html nav-mllib=site.data.menu-mllib nav-ml=site.data.menu-ml %} {% elsif page.url contains "/streaming/" %} {% include nav-left-wrapper-streaming.html nav-streaming=site.data.menu-streaming %} - {% else %} + {% elsif page.url contains "/sql" %} {% include nav-left-wrapper-sql.html nav-sql=site.data.menu-sql %} + {% elsif page.url contains "/spark-connect" %} + {% include nav-left-wrapper-spark-connect.html nav-spark-connect=site.data.menu-spark-connect %} {% endif %} @@ -173,9 +176,7 @@

{{ page.displayTitle }}

{% else %}

{{ page.title }}

{% endif %} - {{ content }} -
{% else %}
diff --git a/docs/spark-connect-overview.md b/docs/spark-connect-overview.md index f01ebf1b54f76..534161a1728be 100644 --- a/docs/spark-connect-overview.md +++ b/docs/spark-connect-overview.md @@ -17,7 +17,6 @@ license: | See the License for the specific language governing permissions and limitations under the License. --- -**Building client-side Spark applications** In Apache Spark 3.4, Spark Connect introduced a decoupled client-server architecture that allows remote connectivity to Spark clusters using the @@ -28,6 +27,9 @@ in IDEs, Notebooks and programming languages. To get started, see [Quickstart: Spark Connect](api/python/getting_started/quickstart_connect.html). +* This will become a table of contents (this text will be scraped). +{:toc} +

Spark Connect API Diagram

@@ -91,287 +93,13 @@ of applications, for example to benefit from performance improvements and securi This means applications can be forward-compatible, as long as the server-side RPC definitions are designed to be backwards compatible. +**Remote connectivity**: The decoupled architecture allows remote connectivity to Spark beyond SQL +and JDBC: any application can now interactively use Spark “as a service”. + **Debuggability and observability**: Spark Connect enables interactive debugging during development directly from your favorite IDE. Similarly, applications can be monitored using the application's framework native metrics and logging libraries. -# How to use Spark Connect - -Spark Connect is available and supports PySpark and Scala -applications. We will walk through how to run an Apache Spark server with Spark -Connect and connect to it from a client application using the Spark Connect client -library. - -## Download and start Spark server with Spark Connect - -First, download Spark from the -[Download Apache Spark](https://spark.apache.org/downloads.html) page. Choose the -latest release in the release drop down at the top of the page. Then choose your package type, typically -“Pre-built for Apache Hadoop 3.3 and later”, and click the link to download. - -Now extract the Spark package you just downloaded on your computer, for example: - -{% highlight bash %} -tar -xvf spark-{{site.SPARK_VERSION_SHORT}}-bin-hadoop3.tgz -{% endhighlight %} - -In a terminal window, go to the `spark` folder in the location where you extracted -Spark before and run the `start-connect-server.sh` script to start Spark server with -Spark Connect, like in this example: - -{% highlight bash %} -./sbin/start-connect-server.sh -{% endhighlight %} - -Make sure to use the same version of the package as the Spark version you -downloaded previously. In this example, Spark {{site.SPARK_VERSION_SHORT}} with Scala 2.13. - -Now Spark server is running and ready to accept Spark Connect sessions from client -applications. In the next section we will walk through how to use Spark Connect -when writing client applications. - -## Use Spark Connect for interactive analysis -
- -
-When creating a Spark session, you can specify that you want to use Spark Connect -and there are a few ways to do that outlined as follows. - -If you do not use one of the mechanisms outlined here, your Spark session will -work just like before, without leveraging Spark Connect. - -### Set SPARK_REMOTE environment variable - -If you set the `SPARK_REMOTE` environment variable on the client machine where your -Spark client application is running and create a new Spark Session as in the following -example, the session will be a Spark Connect session. With this approach, there is no -code change needed to start using Spark Connect. - -In a terminal window, set the `SPARK_REMOTE` environment variable to point to the -local Spark server you started previously on your computer: - -{% highlight bash %} -export SPARK_REMOTE="sc://localhost" -{% endhighlight %} - -And start the Spark shell as usual: - -{% highlight bash %} -./bin/pyspark -{% endhighlight %} - -The PySpark shell is now connected to Spark using Spark Connect as indicated in the welcome message: - -{% highlight python %} -Client connected to the Spark Connect server at localhost -{% endhighlight %} - -### Specify Spark Connect when creating Spark session - -You can also specify that you want to use Spark Connect explicitly when you -create a Spark session. - -For example, you can launch the PySpark shell with Spark Connect as -illustrated here. - -To launch the PySpark shell with Spark Connect, simply include the `remote` -parameter and specify the location of your Spark server. We are using `localhost` -in this example to connect to the local Spark server we started previously: - -{% highlight bash %} -./bin/pyspark --remote "sc://localhost" -{% endhighlight %} - -And you will notice that the PySpark shell welcome message tells you that -you have connected to Spark using Spark Connect: - -{% highlight python %} -Client connected to the Spark Connect server at localhost -{% endhighlight %} - -You can also check the Spark session type. If it includes `.connect.` you -are using Spark Connect as shown in this example: - -{% highlight python %} -SparkSession available as 'spark'. ->>> type(spark) - -{% endhighlight %} - -Now you can run PySpark code in the shell to see Spark Connect in action: - -{% highlight python %} ->>> columns = ["id", "name"] ->>> data = [(1,"Sarah"), (2,"Maria")] ->>> df = spark.createDataFrame(data).toDF(*columns) ->>> df.show() -+---+-----+ -| id| name| -+---+-----+ -| 1|Sarah| -| 2|Maria| -+---+-----+ -{% endhighlight %} - -
- -
-For the Scala shell, we use an Ammonite-based REPL. Otherwise, very similar with PySpark shell. - -{% highlight bash %} -./bin/spark-shell --remote "sc://localhost" -{% endhighlight %} - -A greeting message will appear when the REPL successfully initializes: -{% highlight bash %} -Welcome to - ____ __ - / __/__ ___ _____/ /__ - _\ \/ _ \/ _ `/ __/ '_/ - /___/ .__/\_,_/_/ /_/\_\ version 4.1.0-SNAPSHOT - /_/ - -Type in expressions to have them evaluated. -Spark session available as 'spark'. -{% endhighlight %} - -By default, the REPL will attempt to connect to a local Spark Server. -Run the following Scala code in the shell to see Spark Connect in action: - -{% highlight scala %} -@ spark.range(10).count -res0: Long = 10L -{% endhighlight %} - -### Configure client-server connection - -By default, the REPL will attempt to connect to a local Spark Server on port 15002. -The connection, however, may be configured in several ways as described in this configuration -[reference](https://github.com/apache/spark/blob/master/sql/connect/docs/client-connection-string.md). - -#### Set SPARK_REMOTE environment variable - -The SPARK_REMOTE environment variable can be set on the client machine to customize the client-server -connection that is initialized at REPL startup. - -{% highlight bash %} -export SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG" -./bin/spark-shell -{% endhighlight %} - -or - -{% highlight bash %} -SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG" spark-connect-repl -{% endhighlight %} - -#### Configure programmatically with a connection string - -The connection may also be programmatically created using _SparkSession#builder_ as in this example: - -{% highlight scala %} -@ import org.apache.spark.sql.SparkSession -@ val spark = SparkSession.builder.remote("sc://localhost:443/;token=ABCDEFG").getOrCreate() -{% endhighlight %} - -
-
- -## Use Spark Connect in standalone applications - -
- - -
- -First, install PySpark with `pip install pyspark[connect]=={{site.SPARK_VERSION_SHORT}}` or if building a packaged PySpark application/library, -add it your setup.py file as: -{% highlight python %} -install_requires=[ -'pyspark[connect]=={{site.SPARK_VERSION_SHORT}}' -] -{% endhighlight %} - -When writing your own code, include the `remote` function with a reference to -your Spark server when you create a Spark session, as in this example: - -{% highlight python %} -from pyspark.sql import SparkSession -spark = SparkSession.builder.remote("sc://localhost").getOrCreate() -{% endhighlight %} - - -For illustration purposes, we’ll create a simple Spark Connect application, SimpleApp.py: -{% highlight python %} -"""SimpleApp.py""" -from pyspark.sql import SparkSession - -logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system -spark = SparkSession.builder.remote("sc://localhost").appName("SimpleApp").getOrCreate() -logData = spark.read.text(logFile).cache() - -numAs = logData.filter(logData.value.contains('a')).count() -numBs = logData.filter(logData.value.contains('b')).count() - -print("Lines with a: %i, lines with b: %i" % (numAs, numBs)) - -spark.stop() -{% endhighlight %} - -This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. -Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. - -We can run this application with the regular Python interpreter as follows: -{% highlight python %} -# Use the Python interpreter to run your application -$ python SimpleApp.py -... -Lines with a: 72, lines with b: 39 -{% endhighlight %} -
- - -
-To use Spark Connect as part of a Scala application/project, we first need to include the right dependencies. -Using the `sbt` build system as an example, we add the following dependencies to the `build.sbt` file: -{% highlight sbt %} -libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "{{site.SPARK_VERSION_SHORT}}" -{% endhighlight %} - -When writing your own code, include the `remote` function with a reference to -your Spark server when you create a Spark session, as in this example: - -{% highlight scala %} -import org.apache.spark.sql.SparkSession -val spark = SparkSession.builder().remote("sc://localhost").getOrCreate() -{% endhighlight %} - - -**Note**: Operations that reference User Defined Code such as UDFs, filter, map, etc require a -[ClassFinder](https://github.com/apache/spark/blob/master/sql/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ClassFinder.scala) -to be registered to pickup and upload any required classfiles. Also, any JAR dependencies must be uploaded to the server using `SparkSession#AddArtifact`. - -Example: -{% highlight scala %} -import org.apache.spark.sql.connect.client.REPLClassDirMonitor -// Register a ClassFinder to monitor and upload the classfiles from the build output. -val classFinder = new REPLClassDirMonitor() -spark.registerClassFinder(classFinder) - -// Upload JAR dependencies -spark.addArtifact() -{% endhighlight %} -Here, `ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR` is the output directory where the build system writes classfiles into -and `ABSOLUTE_PATH_JAR_DEP` is the location of the JAR on the local file system. - -The `REPLClassDirMonitor` is a provided implementation of `ClassFinder` that monitors a specific directory but -one may implement their own class extending `ClassFinder` for customized search and monitoring. - -
-
- -For more information on application development with Spark Connect as well as extending Spark Connect -with custom functionality, see [Application Development with Spark Connect](app-dev-spark-connect.html). # Client application authentication While Spark Connect does not have built-in authentication, it is designed to diff --git a/docs/app-dev-spark-connect.md b/docs/spark-connect-server-libs.md similarity index 60% rename from docs/app-dev-spark-connect.md rename to docs/spark-connect-server-libs.md index e61aa05d3f11e..583f80c3701c7 100644 --- a/docs/app-dev-spark-connect.md +++ b/docs/spark-connect-server-libs.md @@ -1,6 +1,6 @@ --- layout: global -title: Application Development with Spark Connect +title: Extending Spark with Spark Server Libraries license: | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with @@ -17,18 +17,6 @@ license: | See the License for the specific language governing permissions and limitations under the License. --- -**Spark Connect Overview** - -In Apache Spark 3.4, Spark Connect introduced a decoupled client-server -architecture that allows remote connectivity to Spark clusters using the -DataFrame API and unresolved logical plans as the protocol. The separation -between client and server allows Spark and its open ecosystem to be -leveraged from everywhere. It can be embedded in modern data applications, -in IDEs, Notebooks and programming languages. - -To learn more about Spark Connect, see [Spark Connect Overview](spark-connect-overview.html). - -# Redefining Spark Applications using Spark Connect With its decoupled client-server architecture, Spark Connect simplifies how Spark Applications are developed. @@ -47,72 +35,20 @@ Client applications connect to Spark using the Spark Connect API, which is essen DataFrame API and fully declarative.

- Extending Spark
-Connect Diagram + Extending Spark Connect Diagram +

Figure 1: Architecture

+ Spark Server Libraries extend Spark. They typically provide additional server-side logic integrated with Spark, which is exposed to client applications as part of the Spark Connect API, using Spark Connect extension points. For example, the _Spark Server Library_ consists of custom service-side logic (as indicated by the blue box labeled _Custom Library Plugin_), which is exposed to the client via the blue box as part of the Spark Connect API. The client uses this API, e.g., alongside PySpark or the Spark Scala client, making it easy for Spark client applications to work -with the custom logic/library. - -## Spark API Mode: Spark Client and Spark Classic - -Spark provides the API mode, `spark.api.mode` configuration, enabling Spark Classic applications -to seamlessly switch to Spark Connect. Depending on the value of `spark.api.mode`, the application -can run in either Spark Classic or Spark Connect mode. Here is an example: - -{% highlight python %} -from pyspark.sql import SparkSession - -SparkSession.builder.config("spark.api.mode", "connect").master("...").getOrCreate() -{% endhighlight %} - -You can also apply this configuration to both Scala and PySpark applications when submitting yours: - -{% highlight bash %} -spark-submit --master "..." --conf spark.api.mode=connect -{% endhighlight %} - -Additionally, Spark Connect offers convenient options for local testing. By setting `spark.remote` -to `local[...]` or `local-cluster[...]`, you can start a local Spark Connect server and access a Spark -Connect session. - -This is similar to using `--conf spark.api.mode=connect` with `--master ...`. However, note that -`spark.remote` and `--remote` are limited to `local*` values, while `--conf spark.api.mode=connect` -with `--master ...` supports additional cluster URLs, such as spark://, for broader compatibility with -Spark Classic. - -## Spark Client Applications - -Spark Client Applications are the _regular Spark applications_ that Spark users develop today, e.g., -ETL pipelines, data preparation, or model training or inference. These are typically built using -Sparks declarative DataFrame and DataSet APIs. With Spark Connect, the core behaviour remains the -same, but there are a few differences: -* Lower-level, non-declarative APIs (RDDs) can no longer be directly used from Spark Client -applications. Alternatives for missing RDD functionality are provided as part of the higher-level -DataFrame API. -* Client applications no longer have direct access to the Spark driver JVM; they are fully -separated from the server. - -Client applications based on Spark Connect can be submitted in the same way as any previous job. -In addition, Spark Client Applications based on Spark Connect have several benefits compared to -classic Spark applications using earlier Spark versions (3.4 and below): -* _Upgradability_: Upgrading to new Spark Server versions is seamless, as the Spark Connect API -abstracts any changes/improvements on the server side. Client- and server APIs are cleanly -separated. -* _Simplicity_: The number of APIs exposed to the user is reduced from 3 to 2. The Spark Connect API -is fully declarative and consequently easy to learn for new users familiar with SQL. -* _Stability_: When using Spark Connect, the client applications no longer run on the Spark driver -and, therefore don’t cause and are not affected by any instability on the server. -* _Remote connectivity_: The decoupled architecture allows remote connectivity to Spark beyond SQL -and JDBC: any application can now interactively use Spark “as a service”. -* _Backwards compatibility_: The Spark Connect API is code-compatible with earlier Spark versions, -except for the usage of RDDs, for which a list of alternative APIs is provided in Spark Connect. - -## Spark Server Libraries +with the custom logic/library. Until Spark 3.4, extensions to Spark (e.g., [Spark ML](ml-guide#:~:text=What%20is%20%E2%80%9CSpark%20ML%E2%80%9D%3F,to%20emphasize%20the%20pipeline%20concept.) or [Spark-NLP](https://github.com/JohnSnowLabs/spark-nlp)) were built and deployed like Spark @@ -122,12 +58,10 @@ exposed to a client, which differs from existing extension points in Spark such [SparkSession extensions](api/java/org/apache/spark/sql/SparkSessionExtensions.html) or [Spark Plugins](api/java/org/apache/spark/api/plugin/SparkPlugin.html). -### Getting Started: Extending Spark with Spark Server Libraries +* This will become a table of contents (this text will be scraped). +{:toc} -Spark Connect is available and supports PySpark and Scala -applications. We will walk through how to run an Apache Spark server with Spark -Connect and connect to it from a client application using the Spark Connect client -library. +# Extending Spark with Spark Server Libraries A Spark Server Library consists of the following components, illustrated in Fig. 2: @@ -137,16 +71,19 @@ A Spark Server Library consists of the following components, illustrated in Fig. 4. The client package that exposes the Spark Server Library application logic to the Spark Client Application, alongside PySpark or the Scala Spark Client.

- Extending Spark
-Connect Diagram - Labelled Steps -

+ Extending Spark Connect Diagram - Labelled Steps +
Figure 2: Labelled Architecture
+

-#### (1) Spark Connect Protocol Extension +# Spark Connect Protocol Extension To extend Spark with a new Spark Server Library, developers can extend the three main operation types in the Spark Connect protocol: _Relation_, _Expression_, and _Command_. -{% highlight protobuf %} +```protobuf message Relation { oneof rel_type { Read read = 1; @@ -169,25 +106,26 @@ message Command { // ... google.protobuf.Any extension = 999; } -} -{% endhighlight %} +} +``` Their extension fields allow serializing arbitrary protobuf messages as part of the Spark Connect protocol. These messages represent the parameters or state of the extension implementation. To build a custom expression type, the developer first defines the custom protobuf definition of the expression. -{% highlight protobuf %} +```protobuf message ExamplePluginExpression { Expression child = 1; string custom_field = 2; } -{% endhighlight %} +``` -#### (2) Spark Connect Plugin implementation with (3) custom application logic +# Spark Connect Plugin implementation with custom application logic -As a next step, the developer implements the _ExpressionPlugin_ class of Spark Connect with custom +As a next step, the developer implements the `ExpressionPlugin` class of Spark Connect with custom application logic based on the input parameters of the protobuf message. -{% highlight protobuf %} + +```scala class ExampleExpressionPlugin extends ExpressionPlugin { override def transform( relation: protobuf.Any, @@ -202,22 +140,22 @@ class ExampleExpressionPlugin extends ExpressionPlugin { exp.getChild), exp.getCustomField)(explicitMetadata = None)) } } -{% endhighlight %} +``` Once the application logic is developed, the code must be packaged as a jar and Spark must be configured to pick up the additional logic. The relevant Spark configuration options are: -* _spark.jars_ which define the location of the Jar file containing the application logic built for +* `spark.jars` which define the location of the Jar file containing the application logic built for the custom expression. -* _spark.connect.extensions.expression.classes_ specifying the full class name +* `spark.connect.extensions.expression.classes` specifying the full class name of each expression extension loaded by Spark. Based on these configuration options, Spark will load the values at startup and make them available for processing. -#### (4) Spark Server Library Client Package +# Spark Server Library Client Package Once the server component is deployed, any client can use it with the right protobuf messages. In the example above, the following message payload sent to the Spark Connect endpoint would be enough to trigger the extension mechanism. -{% highlight json %} +```json { "project": { "input": { @@ -234,14 +172,14 @@ enough to trigger the extension mechanism. } ] } -} -{% endhighlight %} +} +``` To make the example available in Python, the application developer provides a Python library that wraps the new expression and embeds it into PySpark. The easiest way to provide a function for any expression is to take a PySpark column instance as an argument and return a new Column instance with the expression applied. -{% highlight python %} +```python from pyspark.sql.connect.column import Expression import pyspark.sql.connect.proto as proto @@ -267,4 +205,4 @@ def example_expression(col: Column) -> Column: # Using the expression in the Spark Connect client code. df = spark.read.table("samples.nyctaxi.trips") df.select(example_expression(df["fare_amount"])).collect() -{% endhighlight %} \ No newline at end of file +``` diff --git a/docs/spark-connect-setup.md b/docs/spark-connect-setup.md new file mode 100644 index 0000000000000..a9ee8949935c1 --- /dev/null +++ b/docs/spark-connect-setup.md @@ -0,0 +1,322 @@ +--- +layout: global +title: Setting up Spark Connect +license: | + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--- + +Spark Connect supports PySpark and Scala applications. We will walk through how to run an +Apache Spark server with Spark Connect and connect to it from a client application using the +Spark Connect client library. + +* This will become a table of contents (this text will be scraped). +{:toc} + +## Download and start Spark server with Spark Connect + +First, download Spark from the +[Download Apache Spark](https://spark.apache.org/downloads.html) page. Choose the +latest release in the release drop down at the top of the page. Then choose your package type, typically +“Pre-built for Apache Hadoop 3.3 and later”, and click the link to download. + +Now extract the Spark package you just downloaded on your computer, for example: + +```bash +tar -xvf spark-{{site.SPARK_VERSION_SHORT}}-bin-hadoop3.tgz +``` + +In a terminal window, go to the `spark` folder in the location where you extracted +Spark before and run the `start-connect-server.sh` script to start Spark server with +Spark Connect, like in this example: + +```bash +./sbin/start-connect-server.sh +``` + +Make sure to use the same version of the package as the Spark version you +downloaded previously. In this example, Spark {{site.SPARK_VERSION_SHORT}} with Scala 2.13. + +Now Spark server is running and ready to accept Spark Connect sessions from client +applications. In the next section we will walk through how to use Spark Connect +when writing client applications. + +## Use Spark Connect for interactive analysis +
+ +
+When creating a Spark session, you can specify that you want to use Spark Connect +and there are a few ways to do that outlined as follows. + +If you do not use one of the mechanisms outlined here, your Spark session will +work just like before, without leveraging Spark Connect. + +### Set SPARK_REMOTE environment variable + +If you set the `SPARK_REMOTE` environment variable on the client machine where your +Spark client application is running and create a new Spark Session as in the following +example, the session will be a Spark Connect session. With this approach, there is no +code change needed to start using Spark Connect. + +In a terminal window, set the `SPARK_REMOTE` environment variable to point to the +local Spark server you started previously on your computer: + +```bash +export SPARK_REMOTE="sc://localhost" +``` + +And start the Spark shell as usual: + +```bash +./bin/pyspark +``` + +The PySpark shell is now connected to Spark using Spark Connect as indicated in the welcome message: + +```python +Client connected to the Spark Connect server at localhost +``` + +### Specify Spark Connect when creating Spark session + +You can also specify that you want to use Spark Connect explicitly when you +create a Spark session. + +For example, you can launch the PySpark shell with Spark Connect as +illustrated here. + +To launch the PySpark shell with Spark Connect, simply include the `remote` +parameter and specify the location of your Spark server. We are using `localhost` +in this example to connect to the local Spark server we started previously: + +```bash +./bin/pyspark --remote "sc://localhost" +``` + +And you will notice that the PySpark shell welcome message tells you that +you have connected to Spark using Spark Connect: + +```python +Client connected to the Spark Connect server at localhost +``` + +You can also check the Spark session type. If it includes `.connect.` you +are using Spark Connect as shown in this example: + +```python +SparkSession available as 'spark'. +>>> type(spark) + +``` + +Now you can run PySpark code in the shell to see Spark Connect in action: + +```python +>>> columns = ["id", "name"] +>>> data = [(1,"Sarah"), (2,"Maria")] +>>> df = spark.createDataFrame(data).toDF(*columns) +>>> df.show() ++---+-----+ +| id| name| ++---+-----+ +| 1|Sarah| +| 2|Maria| ++---+-----+ +``` + +
+ +
+For the Scala shell, we use an Ammonite-based REPL. Otherwise, very similar with PySpark shell. + +```bash +./bin/spark-shell --remote "sc://localhost" +``` + +A greeting message will appear when the REPL successfully initializes: +```bash +Welcome to + ____ __ + / __/__ ___ _____/ /__ + _\ \/ _ \/ _ `/ __/ '_/ + /___/ .__/\_,_/_/ /_/\_\ version 4.1.0-SNAPSHOT + /_/ + +Type in expressions to have them evaluated. +Spark session available as 'spark'. +``` + +By default, the REPL will attempt to connect to a local Spark Server. +Run the following Scala code in the shell to see Spark Connect in action: + +```scala +@ spark.range(10).count +res0: Long = 10L +``` + +### Configure client-server connection + +By default, the REPL will attempt to connect to a local Spark Server on port 15002. +The connection, however, may be configured in several ways as described in this configuration +[reference](https://github.com/apache/spark/blob/master/sql/connect/docs/client-connection-string.md). + +#### Set SPARK_REMOTE environment variable + +The SPARK_REMOTE environment variable can be set on the client machine to customize the client-server +connection that is initialized at REPL startup. + +```bash +export SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG" +./bin/spark-shell +``` + +or + +```bash +SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG" spark-connect-repl +``` + +#### Configure programmatically with a connection string + +The connection may also be programmatically created using _SparkSession#builder_ as in this example: + +```scala +@ import org.apache.spark.sql.SparkSession +@ val spark = SparkSession.builder.remote("sc://localhost:443/;token=ABCDEFG").getOrCreate() +``` + +
+
+ +## Use Spark Connect in standalone applications + +
+ +
+ +First, install PySpark with `pip install pyspark[connect]=={{site.SPARK_VERSION_SHORT}}` or if building a packaged PySpark application/library, +add it your setup.py file as: +```python +install_requires=[ + 'pyspark[connect]=={{site.SPARK_VERSION_SHORT}}' +] +``` + +When writing your own code, include the `remote` function with a reference to +your Spark server when you create a Spark session, as in this example: + +```python +from pyspark.sql import SparkSession +spark = SparkSession.builder.remote("sc://localhost").getOrCreate() +``` + + +For illustration purposes, we’ll create a simple Spark Connect application, SimpleApp.py: +```python +"""SimpleApp.py""" +from pyspark.sql import SparkSession + +logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system +spark = SparkSession.builder.remote("sc://localhost").appName("SimpleApp").getOrCreate() +logData = spark.read.text(logFile).cache() + +numAs = logData.filter(logData.value.contains('a')).count() +numBs = logData.filter(logData.value.contains('b')).count() + +print("Lines with a: %i, lines with b: %i" % (numAs, numBs)) + +spark.stop() +``` + +This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file. +Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed. + +We can run this application with the regular Python interpreter as follows: +```python +# Use the Python interpreter to run your application +$ python SimpleApp.py +... +Lines with a: 72, lines with b: 39 +``` +
+ + +
+To use Spark Connect as part of a Scala application/project, we first need to include the right dependencies. +Using the `sbt` build system as an example, we add the following dependencies to the `build.sbt` file: +```sbt +libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "{{site.SPARK_VERSION_SHORT}}" +``` + +When writing your own code, include the `remote` function with a reference to +your Spark server when you create a Spark session, as in this example: + +```scala +import org.apache.spark.sql.SparkSession +val spark = SparkSession.builder().remote("sc://localhost").getOrCreate() +``` + + +**Note**: Operations that reference User Defined Code such as UDFs, filter, map, etc require a +[ClassFinder](https://github.com/apache/spark/blob/master/sql/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ClassFinder.scala) +to be registered to pickup and upload any required classfiles. Also, any JAR dependencies must be uploaded to the server using `SparkSession#AddArtifact`. + +Example: +```scala +import org.apache.spark.sql.connect.client.REPLClassDirMonitor +// Register a ClassFinder to monitor and upload the classfiles from the build output. +val classFinder = new REPLClassDirMonitor() +spark.registerClassFinder(classFinder) + +// Upload JAR dependencies +spark.addArtifact() +``` +Here, `ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR` is the output directory where the build system writes classfiles into +and `ABSOLUTE_PATH_JAR_DEP` is the location of the JAR on the local file system. + +The `REPLClassDirMonitor` is a provided implementation of `ClassFinder` that monitors a specific directory but +one may implement their own class extending `ClassFinder` for customized search and monitoring. + +
+
+ +# Switching between Spark Connect and Spark Classic + +Spark provides the `spark.api.mode` configuration, enabling Spark Classic applications +to seamlessly switch to Spark Connect. Depending on the value of `spark.api.mode`, the application +can run in either Spark Classic or Spark Connect mode. + +Here is an example: + +```python +from pyspark.sql import SparkSession + +SparkSession.builder.config("spark.api.mode", "connect").master("...").getOrCreate() +``` + +You can also apply this configuration when submitting applications (Scala or Python) via `spark-submit`: + +```bash +spark-submit --master "..." --conf spark.api.mode=connect +``` + +Additionally, Spark Connect offers convenient options for local testing. By setting `spark.remote` +to `local[...]` or `local-cluster[...]`, you can start a local Spark Connect server and access a Spark +Connect session. + +This is similar to using `--conf spark.api.mode=connect` with `--master ...`. However, note that +`spark.remote` and `--remote` are limited to `local*` values, while `--conf spark.api.mode=connect` +with `--master ...` supports additional cluster URLs, such as spark://, for broader compatibility with +Spark Classic. diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 16bce4527fdab..0bc997260c0c3 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -499,7 +499,7 @@ The following SQL properties enable Storage Partition Join in different join que If Storage Partition Join is performed, the query plan will not contain Exchange nodes prior to the join. -The following example uses Iceberg ([https://iceberg.apache.org/docs/latest/spark-getting-started/](https://iceberg.apache.org/docs/latest/spark-getting-started/)), a Spark V2 DataSource that supports Storage Partition Join. +The following example uses [Iceberg](https://iceberg.apache.org/docs/latest/spark-getting-started/), a Spark V2 DataSource that supports Storage Partition Join. ```sql CREATE TABLE prod.db.target (id INT, salary INT, dep STRING) USING iceberg @@ -546,4 +546,4 @@ SET 'spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled' 'tru +- * Filter (7) +- * ColumnarToRow (6) +- BatchScan (5) -``` \ No newline at end of file +``` From b71dcb230772470c3648fe5a64a73413436fe74d Mon Sep 17 00:00:00 2001 From: Nicholas Chammas Date: Fri, 27 Jun 2025 14:42:21 -0400 Subject: [PATCH 2/2] show SPARK_HOME in example --- docs/spark-connect-setup.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/docs/spark-connect-setup.md b/docs/spark-connect-setup.md index a9ee8949935c1..a455e72af483f 100644 --- a/docs/spark-connect-setup.md +++ b/docs/spark-connect-setup.md @@ -40,10 +40,14 @@ tar -xvf spark-{{site.SPARK_VERSION_SHORT}}-bin-hadoop3.tgz In a terminal window, go to the `spark` folder in the location where you extracted Spark before and run the `start-connect-server.sh` script to start Spark server with -Spark Connect, like in this example: +Spark Connect. If you already have Spark installed and `SPARK_HOME` defined, you can use that too. ```bash +cd spark/ ./sbin/start-connect-server.sh + +# alternately +"$SPARK_HOME/sbin/start-connect-server.sh" ``` Make sure to use the same version of the package as the Spark version you