Skip to content

Spark on azure_PV to sign off #2247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,24 +1,20 @@
---
title: Run Spark applications on the Microsoft Azure Cobalt 100 processors

draft: true
cascade:
draft: true
title: Run Spark applications on Microsoft Azure Cobalt 100 processors

minutes_to_complete: 60

who_is_this_for: This is an advanced topic that introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm.

learning_objectives:
- Provision an Azure Arm64 virtual machine using Azure console.
- Learn how to create an Azure Linux 3.0 Docker container.
- Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine.
- Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine.
- Provision an Azure Arm64 virtual machine using Azure console
- Learn how to create an Azure Linux 3.0 Docker container
- Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine
- Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine

prerequisites:
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6).
- A machine with [Docker](/install-guides/docker/) installed.
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/).
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6)
- A machine with [Docker](/install-guides/docker/) installed
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/)

author: Pareena Verma

Expand All @@ -35,7 +31,6 @@ tools_software_languages:
- Python
- Docker


operatingsystems:
- Linux

Expand All @@ -61,7 +56,6 @@ further_reading:
link: https://hadoop.apache.org/
type: website


### FIXED, DO NOT MODIFY
# ================================================================================
weight: 1 # _index.md always has weight of 1 to order correctly
Expand Down
Original file line number Diff line number Diff line change
@@ -1,25 +1,33 @@
---
title: "Overview"

title: Getting started with Microsoft Azure Cobalt 100, Azure Linux 3.0, and Apache Spark
weight: 2

### FIXED, DO NOT MODIFY
layout: "learningpathall"
---

## What is the Azure Cobalt 100 processor?
## Key technologies for running Apache Spark on Azure Cobalt 100

This section introduces the key technologies you will use when running Spark applications on Microsoft Azure Cobalt 100 processors. You will learn about the Azure Cobalt 100 Arm-based processor, Azure Linux 3.0, and Apache Spark.

## Azure Cobalt 100 processor

Azure Cobalt 100 is Microsoft’s first-generation Arm-based processor, designed for cloud-native, scale-out Linux workloads. Based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency. Running at 3.4 GHz, it provides a dedicated physical core for each vCPU, ensuring consistent and predictable performance.

Typical workloads include web and application servers, data analytics, open-source databases, and caching systems.

Azure’s Cobalt 100 is built on Microsoft's first-generation Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
To learn more, see the Microsoft blog [Announcing the preview of new Azure virtual machines based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).

To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
## Azure Linux 3.0

## Introduction to Azure Linux 3.0
Azure Linux 3.0 is Microsoft’s lightweight Linux distribution optimized for cloud-native workloads on Azure. It is designed for performance, security, and reliability. Azure Linux 3.0 is tailored for containers, microservices, and Kubernetes.

Azure Linux 3.0 is Microsoft's in-house, lightweight Linux distribution optimized for running cloud-native workloads on Azure. Designed with performance, security, and reliability in mind, it is fully supported by Microsoft and tailored for containers, microservices, and Kubernetes. With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on energy-efficient Arm-based infrastructure, making it a powerful choice for scalable and cost-effective cloud deployments.
With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on Arm-based infrastructure, making it a scalable and cost-effective choice for cloud deployments.

## Apache Spark
## Apache Spark

Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing.
Apache Spark is an open-source, distributed computing system for fast, general-purpose big data processing. It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for improved performance.

It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for increased performance.
Spark is widely used for large-scale data analytics, machine learning, and real-time data processing.

Spark is widely used for large-scale data analytics, machine learning, and real-time data processing. Learn more from the [Apache Spark official website](https://spark.apache.org/) and its [detailed official documentation](https://spark.apache.org/docs/latest/).
Learn more at the [Apache Spark official website](https://spark.apache.org/) and in the [official documentation](https://spark.apache.org/docs/latest/).
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
---
title: Functional Validation
title: Validate Apache Spark on Azure Cobalt 100 Arm64 VMs
weight: 6

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Run a functional test of Apache Spark on Azure Cobalt 100

## Functional Validation
Since Apache Spark is installed successfully on your Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output.
After installing Apache Spark on your Arm64 virtual machine, you can perform simple baseline testing to validate that Spark runs correctly and produces the expected output.

Using a file editor of your choice, create a file named `test_spark.py`, and add the below content to it:
## Create a test Spark application

Use a text editor of your choice to create a file named `test_spark.py` with the following content:

```python
from pyspark.sql import SparkSession
Expand All @@ -19,11 +21,18 @@ df = spark.createDataFrame([(1, "ARM64"), (2, "Azure")], ["id", "name"])
df.show()
spark.stop()
```
Execute with:

## Run the Spark application

Execute the test script with:

```console
spark-submit test_spark.py
```
You should see an output similar to:

## Example output

You should see output similar to:

```output
25/07/22 05:16:00 INFO CodeGenerator: Code generated in 10.545923 ms
Expand All @@ -35,7 +44,9 @@ You should see an output similar to:
| 2|Azure|
+---+-----+
```
Output summary:

- The output shows Spark successfully generated code **(10.5ms)** and executed a simple DataFrame operation.
- Displaying the test data **[1, "ARM64"]** and **[2, "Azure"]** before cleanly shutting down **(exitCode 0)**. This confirms a working Spark deployment on Arm64.
## Output summary

- Spark successfully generated code (10.5 ms) and executed a simple DataFrame operation.
- The test data **[1, "ARM64"]** and **[2, "Azure"]** was displayed before cleanly shutting down (exitCode 0).
- This confirms a working Spark deployment on Arm64.
Original file line number Diff line number Diff line change
@@ -1,40 +1,51 @@
---
title: Benchmark Spark
title: Benchmark Apache Spark
weight: 7

### FIXED, DO NOT MODIFY
layout: learningpathall
---

## Apache Spark Internal Benchmarking
Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing platforms such as x86_64 vs Arm64.
Below are the steps to run Spark’s built-in SQL benchmarks using the SBT-based framework.
## Benchmark Apache Spark on Azure Cobalt 100 Arm-based instances and x86_64 instances

1. Clone the Apache Spark source code
```console
git clone https://github.com/apache/spark.git
```
This downloads the full Spark source including internal test suites and the benchmarking tools.
Apache Spark includes internal micro-benchmarks to evaluate the performance of core components such as SQL execution, aggregation, joins, and data source reads. These benchmarks are useful for comparing performance on Arm64 and x86_64 platforms in Azure.

2. Checkout the desired Spark version
```console
cd spark/ && git checkout v4.0.0
```
Switch to the stable Spark 4.0.0 release, which supports the latest internal benchmarking APIs.
This section shows you how to run Spark’s built-in SQL benchmarks using the SBT-based framework.

3. Build Spark with benchmarking profile enabled
```console
./build/sbt -Pbenchmarks clean package
```
This compiles Spark and its dependencies, enabling the benchmarks build profile for performance testing.
## Steps to run Spark benchmarks

4. Run a built-in benchmark suite
```console
./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark"
```
This executes the `JoinBenchmark`, which measures the performance of various SQL join operations (e.g., SortMergeJoin, BroadcastHashJoin) under different query plans. It helps evaluate how Spark SQL optimizes and executes join strategies, especially with and without WholeStageCodegen, a technique that compiles entire query stages into efficient bytecode for faster execution.
1. Clone the Apache Spark source code

```console
git clone https://github.com/apache/spark.git
```
This downloads the full Spark source code, including test suites and benchmarking tools

2. Checkout the desired Spark version

```console
cd spark/ && git checkout v4.0.0
```
Switch to the stable Spark 4.0.0 release, which supports the latest benchmarking APIs

3. Build Spark with the benchmarking profile

```console
./build/sbt -Pbenchmarks clean package
```
This compiles Spark and its dependencies, enabling the benchmarking build profile

4. Run a built-in benchmark suite

```console
./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark"
```
This runs the `JoinBenchmark`, which measures the performance of SQL join operations such as `SortMergeJoin` and `BroadcastHashJoin`. It evaluates how Spark SQL optimizes join strategies, especially with and without WholeStageCodegen

## Example Apache Spark benchmark output

You should see output similar to the following:

The output should look similar to:
```output
[info] Running benchmark: Join w long
[info] Running case: Join w long wholestage off
Expand Down Expand Up @@ -170,7 +181,7 @@ The output should look similar to:
[info] broadcast nested loop join wholestage on 18857 18928 84 1.1 899.2 1.4X
[success] Total time: 1644 s (27:24), completed Jul 25, 2025, 6:27:46 AM
```
### Benchmark Results Table Explained:
## Benchmark Results Table Explained:

- **Best Time (ms):** Fastest execution time observed (in milliseconds).
- **Avg Time (ms):** Average time across all iterations.
Expand All @@ -180,11 +191,11 @@ The output should look similar to:
- **Relative Speed comparison:** baseline (1.0X) is the slower version.

{{% notice Note %}}
Benchmarking was performed in both an Azure Linux 3.0 Docker container and an Azure Linux 3.0 virtual machine. The benchmark results were found to be comparable.
Benchmark results on Azure Linux 3.0 were consistent across both Docker containers and virtual machines.
{{% /notice %}}


### Benchmark summary on Arm64:
## Benchmark summary on Arm64:
For easier comparison, shown here is a summary of benchmark results collected on an Arm64 `D4ps_v6` Azure virtual machine created from a custom Azure Linux 3.0 image using the AArch64 ISO.
| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
|----------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
Expand All @@ -211,7 +222,7 @@ For easier comparison, shown here is a summary of benchmark results collected on
| Broadcast nested loop join | Off | 26847 | 26870 | 32 | 0.8 | 1280.2 | 1.0X |
| | On | 18857 | 18928 | 84 | 1.1 | 899.2 | 1.4X |

### Benchmark summary on x86_64:
## Benchmark summary on x86_64:
Shown here is a summary of the benchmark results collected on an `x86_64` `D4s_v4` Azure virtual machine using the Azure Linux 3.0 image published by Ntegral Inc.
| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
|------------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
Expand Down Expand Up @@ -239,13 +250,13 @@ Shown here is a summary of the benchmark results collected on an `x86_64` `D4s_v
| | On | 31254 | 31346 | 78 | 0.7 | 1490.3 | 1.2X |


### Benchmark comparison insights
## Benchmark comparison insights

When you compare the benchmark results you will notice that on the Azure Linux Arm64 virtual machine:
When comparing the results on Arm64 vs x86_64 virtual machines:

- Whole-stage codegen improves performance by up to 2.8× on complex joins (e.g., with long columns).
- Simple joins (e.g., on integers) show negligible performance gain, remains comparable to performance on `x86_64`.
- Broadcast and shuffle-based joins benefit with 1.4× to 1.5× improvements.
- Overall enabling whole-stage codegen consistently improves performance across most join types.
- Whole-stage codegen improves performance by up to 2.8× on complex joins
- Simple joins, such as integer joins, show negligible performance differences
- Broadcast and shuffle-based joins achieve 1.4× to 1.5× improvements
- Enabling whole-stage codegen consistently improves performance across most join types

You have successfully learnt how to deploy Apache Spark on an Azure Cobalt 100 virtual machine and measure the performance uplift.
You have now benchmarked Apache Spark on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64.
Loading