-
-
-
-## Use Spark Connect in standalone applications
-
-
-When creating a Spark session, you can specify that you want to use Spark Connect
-and there are a few ways to do that outlined as follows.
-
-If you do not use one of the mechanisms outlined here, your Spark session will
-work just like before, without leveraging Spark Connect.
-
-### Set SPARK_REMOTE environment variable
-
-If you set the `SPARK_REMOTE` environment variable on the client machine where your
-Spark client application is running and create a new Spark Session as in the following
-example, the session will be a Spark Connect session. With this approach, there is no
-code change needed to start using Spark Connect.
-
-In a terminal window, set the `SPARK_REMOTE` environment variable to point to the
-local Spark server you started previously on your computer:
-
-{% highlight bash %}
-export SPARK_REMOTE="sc://localhost"
-{% endhighlight %}
-
-And start the Spark shell as usual:
-
-{% highlight bash %}
-./bin/pyspark
-{% endhighlight %}
-
-The PySpark shell is now connected to Spark using Spark Connect as indicated in the welcome message:
-
-{% highlight python %}
-Client connected to the Spark Connect server at localhost
-{% endhighlight %}
-
-### Specify Spark Connect when creating Spark session
-
-You can also specify that you want to use Spark Connect explicitly when you
-create a Spark session.
-
-For example, you can launch the PySpark shell with Spark Connect as
-illustrated here.
-
-To launch the PySpark shell with Spark Connect, simply include the `remote`
-parameter and specify the location of your Spark server. We are using `localhost`
-in this example to connect to the local Spark server we started previously:
-
-{% highlight bash %}
-./bin/pyspark --remote "sc://localhost"
-{% endhighlight %}
-
-And you will notice that the PySpark shell welcome message tells you that
-you have connected to Spark using Spark Connect:
-
-{% highlight python %}
-Client connected to the Spark Connect server at localhost
-{% endhighlight %}
-
-You can also check the Spark session type. If it includes `.connect.` you
-are using Spark Connect as shown in this example:
-
-{% highlight python %}
-SparkSession available as 'spark'.
->>> type(spark)
-
-{% endhighlight %}
-
-Now you can run PySpark code in the shell to see Spark Connect in action:
-
-{% highlight python %}
->>> columns = ["id", "name"]
->>> data = [(1,"Sarah"), (2,"Maria")]
->>> df = spark.createDataFrame(data).toDF(*columns)
->>> df.show()
-+---+-----+
-| id| name|
-+---+-----+
-| 1|Sarah|
-| 2|Maria|
-+---+-----+
-{% endhighlight %}
-
-
-
-
-For the Scala shell, we use an Ammonite-based REPL. Otherwise, very similar with PySpark shell.
-
-{% highlight bash %}
-./bin/spark-shell --remote "sc://localhost"
-{% endhighlight %}
-
-A greeting message will appear when the REPL successfully initializes:
-{% highlight bash %}
-Welcome to
- ____ __
- / __/__ ___ _____/ /__
- _\ \/ _ \/ _ `/ __/ '_/
- /___/ .__/\_,_/_/ /_/\_\ version 4.1.0-SNAPSHOT
- /_/
-
-Type in expressions to have them evaluated.
-Spark session available as 'spark'.
-{% endhighlight %}
-
-By default, the REPL will attempt to connect to a local Spark Server.
-Run the following Scala code in the shell to see Spark Connect in action:
-
-{% highlight scala %}
-@ spark.range(10).count
-res0: Long = 10L
-{% endhighlight %}
-
-### Configure client-server connection
-
-By default, the REPL will attempt to connect to a local Spark Server on port 15002.
-The connection, however, may be configured in several ways as described in this configuration
-[reference](https://github.com/apache/spark/blob/master/sql/connect/docs/client-connection-string.md).
-
-#### Set SPARK_REMOTE environment variable
-
-The SPARK_REMOTE environment variable can be set on the client machine to customize the client-server
-connection that is initialized at REPL startup.
-
-{% highlight bash %}
-export SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG"
-./bin/spark-shell
-{% endhighlight %}
-
-or
-
-{% highlight bash %}
-SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG" spark-connect-repl
-{% endhighlight %}
-
-#### Configure programmatically with a connection string
-
-The connection may also be programmatically created using _SparkSession#builder_ as in this example:
-
-{% highlight scala %}
-@ import org.apache.spark.sql.SparkSession
-@ val spark = SparkSession.builder.remote("sc://localhost:443/;token=ABCDEFG").getOrCreate()
-{% endhighlight %}
-
-
-
-
-
-
-
-For more information on application development with Spark Connect as well as extending Spark Connect
-with custom functionality, see [Application Development with Spark Connect](app-dev-spark-connect.html).
# Client application authentication
While Spark Connect does not have built-in authentication, it is designed to
diff --git a/docs/app-dev-spark-connect.md b/docs/spark-connect-server-libs.md
similarity index 60%
rename from docs/app-dev-spark-connect.md
rename to docs/spark-connect-server-libs.md
index e61aa05d3f11e..583f80c3701c7 100644
--- a/docs/app-dev-spark-connect.md
+++ b/docs/spark-connect-server-libs.md
@@ -1,6 +1,6 @@
---
layout: global
-title: Application Development with Spark Connect
+title: Extending Spark with Spark Server Libraries
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
@@ -17,18 +17,6 @@ license: |
See the License for the specific language governing permissions and
limitations under the License.
---
-**Spark Connect Overview**
-
-In Apache Spark 3.4, Spark Connect introduced a decoupled client-server
-architecture that allows remote connectivity to Spark clusters using the
-DataFrame API and unresolved logical plans as the protocol. The separation
-between client and server allows Spark and its open ecosystem to be
-leveraged from everywhere. It can be embedded in modern data applications,
-in IDEs, Notebooks and programming languages.
-
-To learn more about Spark Connect, see [Spark Connect Overview](spark-connect-overview.html).
-
-# Redefining Spark Applications using Spark Connect
With its decoupled client-server architecture, Spark Connect simplifies how Spark Applications are
developed.
@@ -47,72 +35,20 @@ Client applications connect to Spark using the Spark Connect API, which is essen
DataFrame API and fully declarative.
-
-First, install PySpark with `pip install pyspark[connect]=={{site.SPARK_VERSION_SHORT}}` or if building a packaged PySpark application/library,
-add it your setup.py file as:
-{% highlight python %}
-install_requires=[
-'pyspark[connect]=={{site.SPARK_VERSION_SHORT}}'
-]
-{% endhighlight %}
-
-When writing your own code, include the `remote` function with a reference to
-your Spark server when you create a Spark session, as in this example:
-
-{% highlight python %}
-from pyspark.sql import SparkSession
-spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
-{% endhighlight %}
-
-
-For illustration purposes, we’ll create a simple Spark Connect application, SimpleApp.py:
-{% highlight python %}
-"""SimpleApp.py"""
-from pyspark.sql import SparkSession
-
-logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system
-spark = SparkSession.builder.remote("sc://localhost").appName("SimpleApp").getOrCreate()
-logData = spark.read.text(logFile).cache()
-
-numAs = logData.filter(logData.value.contains('a')).count()
-numBs = logData.filter(logData.value.contains('b')).count()
-
-print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
-
-spark.stop()
-{% endhighlight %}
-
-This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file.
-Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.
-
-We can run this application with the regular Python interpreter as follows:
-{% highlight python %}
-# Use the Python interpreter to run your application
-$ python SimpleApp.py
-...
-Lines with a: 72, lines with b: 39
-{% endhighlight %}
-
-
-
-
-To use Spark Connect as part of a Scala application/project, we first need to include the right dependencies.
-Using the `sbt` build system as an example, we add the following dependencies to the `build.sbt` file:
-{% highlight sbt %}
-libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "{{site.SPARK_VERSION_SHORT}}"
-{% endhighlight %}
-
-When writing your own code, include the `remote` function with a reference to
-your Spark server when you create a Spark session, as in this example:
-
-{% highlight scala %}
-import org.apache.spark.sql.SparkSession
-val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()
-{% endhighlight %}
-
-
-**Note**: Operations that reference User Defined Code such as UDFs, filter, map, etc require a
-[ClassFinder](https://github.com/apache/spark/blob/master/sql/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ClassFinder.scala)
-to be registered to pickup and upload any required classfiles. Also, any JAR dependencies must be uploaded to the server using `SparkSession#AddArtifact`.
-
-Example:
-{% highlight scala %}
-import org.apache.spark.sql.connect.client.REPLClassDirMonitor
-// Register a ClassFinder to monitor and upload the classfiles from the build output.
-val classFinder = new REPLClassDirMonitor()
-spark.registerClassFinder(classFinder)
-
-// Upload JAR dependencies
-spark.addArtifact()
-{% endhighlight %}
-Here, `ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR` is the output directory where the build system writes classfiles into
-and `ABSOLUTE_PATH_JAR_DEP` is the location of the JAR on the local file system.
-
-The `REPLClassDirMonitor` is a provided implementation of `ClassFinder` that monitors a specific directory but
-one may implement their own class extending `ClassFinder` for customized search and monitoring.
-
-
-
-
+
+
-
-

+
+
+
+## Use Spark Connect in standalone applications
+
+
+When creating a Spark session, you can specify that you want to use Spark Connect
+and there are a few ways to do that outlined as follows.
+
+If you do not use one of the mechanisms outlined here, your Spark session will
+work just like before, without leveraging Spark Connect.
+
+### Set SPARK_REMOTE environment variable
+
+If you set the `SPARK_REMOTE` environment variable on the client machine where your
+Spark client application is running and create a new Spark Session as in the following
+example, the session will be a Spark Connect session. With this approach, there is no
+code change needed to start using Spark Connect.
+
+In a terminal window, set the `SPARK_REMOTE` environment variable to point to the
+local Spark server you started previously on your computer:
+
+```bash
+export SPARK_REMOTE="sc://localhost"
+```
+
+And start the Spark shell as usual:
+
+```bash
+./bin/pyspark
+```
+
+The PySpark shell is now connected to Spark using Spark Connect as indicated in the welcome message:
+
+```python
+Client connected to the Spark Connect server at localhost
+```
+
+### Specify Spark Connect when creating Spark session
+
+You can also specify that you want to use Spark Connect explicitly when you
+create a Spark session.
+
+For example, you can launch the PySpark shell with Spark Connect as
+illustrated here.
+
+To launch the PySpark shell with Spark Connect, simply include the `remote`
+parameter and specify the location of your Spark server. We are using `localhost`
+in this example to connect to the local Spark server we started previously:
+
+```bash
+./bin/pyspark --remote "sc://localhost"
+```
+
+And you will notice that the PySpark shell welcome message tells you that
+you have connected to Spark using Spark Connect:
+
+```python
+Client connected to the Spark Connect server at localhost
+```
+
+You can also check the Spark session type. If it includes `.connect.` you
+are using Spark Connect as shown in this example:
+
+```python
+SparkSession available as 'spark'.
+>>> type(spark)
+
+```
+
+Now you can run PySpark code in the shell to see Spark Connect in action:
+
+```python
+>>> columns = ["id", "name"]
+>>> data = [(1,"Sarah"), (2,"Maria")]
+>>> df = spark.createDataFrame(data).toDF(*columns)
+>>> df.show()
++---+-----+
+| id| name|
++---+-----+
+| 1|Sarah|
+| 2|Maria|
++---+-----+
+```
+
+
+
+
+For the Scala shell, we use an Ammonite-based REPL. Otherwise, very similar with PySpark shell.
+
+```bash
+./bin/spark-shell --remote "sc://localhost"
+```
+
+A greeting message will appear when the REPL successfully initializes:
+```bash
+Welcome to
+ ____ __
+ / __/__ ___ _____/ /__
+ _\ \/ _ \/ _ `/ __/ '_/
+ /___/ .__/\_,_/_/ /_/\_\ version 4.1.0-SNAPSHOT
+ /_/
+
+Type in expressions to have them evaluated.
+Spark session available as 'spark'.
+```
+
+By default, the REPL will attempt to connect to a local Spark Server.
+Run the following Scala code in the shell to see Spark Connect in action:
+
+```scala
+@ spark.range(10).count
+res0: Long = 10L
+```
+
+### Configure client-server connection
+
+By default, the REPL will attempt to connect to a local Spark Server on port 15002.
+The connection, however, may be configured in several ways as described in this configuration
+[reference](https://github.com/apache/spark/blob/master/sql/connect/docs/client-connection-string.md).
+
+#### Set SPARK_REMOTE environment variable
+
+The SPARK_REMOTE environment variable can be set on the client machine to customize the client-server
+connection that is initialized at REPL startup.
+
+```bash
+export SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG"
+./bin/spark-shell
+```
+
+or
+
+```bash
+SPARK_REMOTE="sc://myhost.com:443/;token=ABCDEFG" spark-connect-repl
+```
+
+#### Configure programmatically with a connection string
+
+The connection may also be programmatically created using _SparkSession#builder_ as in this example:
+
+```scala
+@ import org.apache.spark.sql.SparkSession
+@ val spark = SparkSession.builder.remote("sc://localhost:443/;token=ABCDEFG").getOrCreate()
+```
+
+
+
+
+
+
+# Switching between Spark Connect and Spark Classic
+
+Spark provides the `spark.api.mode` configuration, enabling Spark Classic applications
+to seamlessly switch to Spark Connect. Depending on the value of `spark.api.mode`, the application
+can run in either Spark Classic or Spark Connect mode.
+
+Here is an example:
+
+```python
+from pyspark.sql import SparkSession
+
+SparkSession.builder.config("spark.api.mode", "connect").master("...").getOrCreate()
+```
+
+You can also apply this configuration when submitting applications (Scala or Python) via `spark-submit`:
+
+```bash
+spark-submit --master "..." --conf spark.api.mode=connect
+```
+
+Additionally, Spark Connect offers convenient options for local testing. By setting `spark.remote`
+to `local[...]` or `local-cluster[...]`, you can start a local Spark Connect server and access a Spark
+Connect session.
+
+This is similar to using `--conf spark.api.mode=connect` with `--master ...`. However, note that
+`spark.remote` and `--remote` are limited to `local*` values, while `--conf spark.api.mode=connect`
+with `--master ...` supports additional cluster URLs, such as spark://, for broader compatibility with
+Spark Classic.
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
index 16bce4527fdab..0bc997260c0c3 100644
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@@ -499,7 +499,7 @@ The following SQL properties enable Storage Partition Join in different join que
If Storage Partition Join is performed, the query plan will not contain Exchange nodes prior to the join.
-The following example uses Iceberg ([https://iceberg.apache.org/docs/latest/spark-getting-started/](https://iceberg.apache.org/docs/latest/spark-getting-started/)), a Spark V2 DataSource that supports Storage Partition Join.
+The following example uses [Iceberg](https://iceberg.apache.org/docs/latest/spark-getting-started/), a Spark V2 DataSource that supports Storage Partition Join.
```sql
CREATE TABLE prod.db.target (id INT, salary INT, dep STRING)
USING iceberg
@@ -546,4 +546,4 @@ SET 'spark.sql.sources.v2.bucketing.partiallyClusteredDistribution.enabled' 'tru
+- * Filter (7)
+- * ColumnarToRow (6)
+- BatchScan (5)
-```
\ No newline at end of file
+```
+
+First, install PySpark with `pip install pyspark[connect]=={{site.SPARK_VERSION_SHORT}}` or if building a packaged PySpark application/library,
+add it your setup.py file as:
+```python
+install_requires=[
+ 'pyspark[connect]=={{site.SPARK_VERSION_SHORT}}'
+]
+```
+
+When writing your own code, include the `remote` function with a reference to
+your Spark server when you create a Spark session, as in this example:
+
+```python
+from pyspark.sql import SparkSession
+spark = SparkSession.builder.remote("sc://localhost").getOrCreate()
+```
+
+
+For illustration purposes, we’ll create a simple Spark Connect application, SimpleApp.py:
+```python
+"""SimpleApp.py"""
+from pyspark.sql import SparkSession
+
+logFile = "YOUR_SPARK_HOME/README.md" # Should be some file on your system
+spark = SparkSession.builder.remote("sc://localhost").appName("SimpleApp").getOrCreate()
+logData = spark.read.text(logFile).cache()
+
+numAs = logData.filter(logData.value.contains('a')).count()
+numBs = logData.filter(logData.value.contains('b')).count()
+
+print("Lines with a: %i, lines with b: %i" % (numAs, numBs))
+
+spark.stop()
+```
+
+This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file.
+Note that you’ll need to replace YOUR_SPARK_HOME with the location where Spark is installed.
+
+We can run this application with the regular Python interpreter as follows:
+```python
+# Use the Python interpreter to run your application
+$ python SimpleApp.py
+...
+Lines with a: 72, lines with b: 39
+```
+
+
+
+
+To use Spark Connect as part of a Scala application/project, we first need to include the right dependencies.
+Using the `sbt` build system as an example, we add the following dependencies to the `build.sbt` file:
+```sbt
+libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "{{site.SPARK_VERSION_SHORT}}"
+```
+
+When writing your own code, include the `remote` function with a reference to
+your Spark server when you create a Spark session, as in this example:
+
+```scala
+import org.apache.spark.sql.SparkSession
+val spark = SparkSession.builder().remote("sc://localhost").getOrCreate()
+```
+
+
+**Note**: Operations that reference User Defined Code such as UDFs, filter, map, etc require a
+[ClassFinder](https://github.com/apache/spark/blob/master/sql/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ClassFinder.scala)
+to be registered to pickup and upload any required classfiles. Also, any JAR dependencies must be uploaded to the server using `SparkSession#AddArtifact`.
+
+Example:
+```scala
+import org.apache.spark.sql.connect.client.REPLClassDirMonitor
+// Register a ClassFinder to monitor and upload the classfiles from the build output.
+val classFinder = new REPLClassDirMonitor()
+spark.registerClassFinder(classFinder)
+
+// Upload JAR dependencies
+spark.addArtifact()
+```
+Here, `ABSOLUTE_PATH_TO_BUILD_OUTPUT_DIR` is the output directory where the build system writes classfiles into
+and `ABSOLUTE_PATH_JAR_DEP` is the location of the JAR on the local file system.
+
+The `REPLClassDirMonitor` is a provided implementation of `ClassFinder` that monitors a specific directory but
+one may implement their own class extending `ClassFinder` for customized search and monitoring.
+
+
+