Skip to content

Commit 5f8fee9

Browse files
authored
Merge pull request #2247 from madeline-underwood/spark_on_azure
Spark on azure_PV to sign off
2 parents f4cf235 + 3d8e5af commit 5f8fee9

File tree

7 files changed

+139
-102
lines changed

7 files changed

+139
-102
lines changed

content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md

Lines changed: 8 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,24 +1,20 @@
11
---
2-
title: Run Spark applications on the Microsoft Azure Cobalt 100 processors
3-
4-
draft: true
5-
cascade:
6-
draft: true
2+
title: Run Spark applications on Microsoft Azure Cobalt 100 processors
73

84
minutes_to_complete: 60
95

106
who_is_this_for: This is an advanced topic that introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm.
117

128
learning_objectives:
13-
- Provision an Azure Arm64 virtual machine using Azure console.
14-
- Learn how to create an Azure Linux 3.0 Docker container.
15-
- Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine.
16-
- Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine.
9+
- Provision an Azure Arm64 virtual machine using Azure console
10+
- Learn how to create an Azure Linux 3.0 Docker container
11+
- Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine
12+
- Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine
1713

1814
prerequisites:
19-
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6).
20-
- A machine with [Docker](/install-guides/docker/) installed.
21-
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/).
15+
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6)
16+
- A machine with [Docker](/install-guides/docker/) installed
17+
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/)
2218

2319
author: Pareena Verma
2420

@@ -35,7 +31,6 @@ tools_software_languages:
3531
- Python
3632
- Docker
3733

38-
3934
operatingsystems:
4035
- Linux
4136

@@ -61,7 +56,6 @@ further_reading:
6156
link: https://hadoop.apache.org/
6257
type: website
6358

64-
6559
### FIXED, DO NOT MODIFY
6660
# ================================================================================
6761
weight: 1 # _index.md always has weight of 1 to order correctly
Lines changed: 19 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,33 @@
11
---
2-
title: "Overview"
3-
2+
title: Getting started with Microsoft Azure Cobalt 100, Azure Linux 3.0, and Apache Spark
43
weight: 2
54

5+
### FIXED, DO NOT MODIFY
66
layout: "learningpathall"
77
---
88

9-
## What is the Azure Cobalt 100 processor?
9+
## Key technologies for running Apache Spark on Azure Cobalt 100
10+
11+
This section introduces the key technologies you will use when running Spark applications on Microsoft Azure Cobalt 100 processors. You will learn about the Azure Cobalt 100 Arm-based processor, Azure Linux 3.0, and Apache Spark.
12+
13+
## Azure Cobalt 100 processor
14+
15+
Azure Cobalt 100 is Microsoft’s first-generation Arm-based processor, designed for cloud-native, scale-out Linux workloads. Based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency. Running at 3.4 GHz, it provides a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
16+
17+
Typical workloads include web and application servers, data analytics, open-source databases, and caching systems.
1018

11-
Azure’s Cobalt 100 is built on Microsoft's first-generation Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
19+
To learn more, see the Microsoft blog [Announcing the preview of new Azure virtual machines based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
1220

13-
To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
21+
## Azure Linux 3.0
1422

15-
## Introduction to Azure Linux 3.0
23+
Azure Linux 3.0 is Microsoft’s lightweight Linux distribution optimized for cloud-native workloads on Azure. It is designed for performance, security, and reliability. Azure Linux 3.0 is tailored for containers, microservices, and Kubernetes.
1624

17-
Azure Linux 3.0 is Microsoft's in-house, lightweight Linux distribution optimized for running cloud-native workloads on Azure. Designed with performance, security, and reliability in mind, it is fully supported by Microsoft and tailored for containers, microservices, and Kubernetes. With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on energy-efficient Arm-based infrastructure, making it a powerful choice for scalable and cost-effective cloud deployments.
25+
With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on Arm-based infrastructure, making it a scalable and cost-effective choice for cloud deployments.
1826

19-
## Apache Spark
27+
## Apache Spark
2028

21-
Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing.
29+
Apache Spark is an open-source, distributed computing system for fast, general-purpose big data processing. It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for improved performance.
2230

23-
It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for increased performance.
31+
Spark is widely used for large-scale data analytics, machine learning, and real-time data processing.
2432

25-
Spark is widely used for large-scale data analytics, machine learning, and real-time data processing. Learn more from the [Apache Spark official website](https://spark.apache.org/) and its [detailed official documentation](https://spark.apache.org/docs/latest/).
33+
Learn more at the [Apache Spark official website](https://spark.apache.org/) and in the [official documentation](https://spark.apache.org/docs/latest/).
Lines changed: 20 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
11
---
2-
title: Functional Validation
2+
title: Validate Apache Spark on Azure Cobalt 100 Arm64 VMs
33
weight: 6
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9+
## Run a functional test of Apache Spark on Azure Cobalt 100
910

10-
## Functional Validation
11-
Since Apache Spark is installed successfully on your Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output.
11+
After installing Apache Spark on your Arm64 virtual machine, you can perform simple baseline testing to validate that Spark runs correctly and produces the expected output.
1212

13-
Using a file editor of your choice, create a file named `test_spark.py`, and add the below content to it:
13+
## Create a test Spark application
14+
15+
Use a text editor of your choice to create a file named `test_spark.py` with the following content:
1416

1517
```python
1618
from pyspark.sql import SparkSession
@@ -19,11 +21,18 @@ df = spark.createDataFrame([(1, "ARM64"), (2, "Azure")], ["id", "name"])
1921
df.show()
2022
spark.stop()
2123
```
22-
Execute with:
24+
25+
## Run the Spark application
26+
27+
Execute the test script with:
28+
2329
```console
2430
spark-submit test_spark.py
2531
```
26-
You should see an output similar to:
32+
33+
## Example output
34+
35+
You should see output similar to:
2736

2837
```output
2938
25/07/22 05:16:00 INFO CodeGenerator: Code generated in 10.545923 ms
@@ -35,7 +44,9 @@ You should see an output similar to:
3544
| 2|Azure|
3645
+---+-----+
3746
```
38-
Output summary:
3947

40-
- The output shows Spark successfully generated code **(10.5ms)** and executed a simple DataFrame operation.
41-
- Displaying the test data **[1, "ARM64"]** and **[2, "Azure"]** before cleanly shutting down **(exitCode 0)**. This confirms a working Spark deployment on Arm64.
48+
## Output summary
49+
50+
- Spark successfully generated code (10.5 ms) and executed a simple DataFrame operation.
51+
- The test data **[1, "ARM64"]** and **[2, "Azure"]** was displayed before cleanly shutting down (exitCode 0).
52+
- This confirms a working Spark deployment on Arm64.

content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md

Lines changed: 47 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,51 @@
11
---
2-
title: Benchmark Spark
2+
title: Benchmark Apache Spark
33
weight: 7
44

55
### FIXED, DO NOT MODIFY
66
layout: learningpathall
77
---
88

9-
## Apache Spark Internal Benchmarking
10-
Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing platforms such as x86_64 vs Arm64.
11-
Below are the steps to run Spark’s built-in SQL benchmarks using the SBT-based framework.
9+
## Benchmark Apache Spark on Azure Cobalt 100 Arm-based instances and x86_64 instances
1210

13-
1. Clone the Apache Spark source code
14-
```console
15-
git clone https://github.com/apache/spark.git
16-
```
17-
This downloads the full Spark source including internal test suites and the benchmarking tools.
11+
Apache Spark includes internal micro-benchmarks to evaluate the performance of core components such as SQL execution, aggregation, joins, and data source reads. These benchmarks are useful for comparing performance on Arm64 and x86_64 platforms in Azure.
1812

19-
2. Checkout the desired Spark version
20-
```console
21-
cd spark/ && git checkout v4.0.0
22-
```
23-
Switch to the stable Spark 4.0.0 release, which supports the latest internal benchmarking APIs.
13+
This section shows you how to run Spark’s built-in SQL benchmarks using the SBT-based framework.
2414

25-
3. Build Spark with benchmarking profile enabled
26-
```console
27-
./build/sbt -Pbenchmarks clean package
28-
```
29-
This compiles Spark and its dependencies, enabling the benchmarks build profile for performance testing.
15+
## Steps to run Spark benchmarks
3016

31-
4. Run a built-in benchmark suite
32-
```console
33-
./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark"
34-
```
35-
This executes the `JoinBenchmark`, which measures the performance of various SQL join operations (e.g., SortMergeJoin, BroadcastHashJoin) under different query plans. It helps evaluate how Spark SQL optimizes and executes join strategies, especially with and without WholeStageCodegen, a technique that compiles entire query stages into efficient bytecode for faster execution.
17+
1. Clone the Apache Spark source code
18+
19+
```console
20+
git clone https://github.com/apache/spark.git
21+
```
22+
This downloads the full Spark source code, including test suites and benchmarking tools
23+
24+
2. Checkout the desired Spark version
25+
26+
```console
27+
cd spark/ && git checkout v4.0.0
28+
```
29+
Switch to the stable Spark 4.0.0 release, which supports the latest benchmarking APIs
30+
31+
3. Build Spark with the benchmarking profile
32+
33+
```console
34+
./build/sbt -Pbenchmarks clean package
35+
```
36+
This compiles Spark and its dependencies, enabling the benchmarking build profile
37+
38+
4. Run a built-in benchmark suite
39+
40+
```console
41+
./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark"
42+
```
43+
This runs the `JoinBenchmark`, which measures the performance of SQL join operations such as `SortMergeJoin` and `BroadcastHashJoin`. It evaluates how Spark SQL optimizes join strategies, especially with and without WholeStageCodegen
44+
45+
## Example Apache Spark benchmark output
46+
47+
You should see output similar to the following:
3648

37-
The output should look similar to:
3849
```output
3950
[info] Running benchmark: Join w long
4051
[info] Running case: Join w long wholestage off
@@ -170,7 +181,7 @@ The output should look similar to:
170181
[info] broadcast nested loop join wholestage on 18857 18928 84 1.1 899.2 1.4X
171182
[success] Total time: 1644 s (27:24), completed Jul 25, 2025, 6:27:46 AM
172183
```
173-
### Benchmark Results Table Explained:
184+
## Benchmark Results Table Explained:
174185

175186
- **Best Time (ms):** Fastest execution time observed (in milliseconds).
176187
- **Avg Time (ms):** Average time across all iterations.
@@ -180,11 +191,11 @@ The output should look similar to:
180191
- **Relative Speed comparison:** baseline (1.0X) is the slower version.
181192

182193
{{% notice Note %}}
183-
Benchmarking was performed in both an Azure Linux 3.0 Docker container and an Azure Linux 3.0 virtual machine. The benchmark results were found to be comparable.
194+
Benchmark results on Azure Linux 3.0 were consistent across both Docker containers and virtual machines.
184195
{{% /notice %}}
185196

186197

187-
### Benchmark summary on Arm64:
198+
## Benchmark summary on Arm64:
188199
For easier comparison, shown here is a summary of benchmark results collected on an Arm64 `D4ps_v6` Azure virtual machine created from a custom Azure Linux 3.0 image using the AArch64 ISO.
189200
| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
190201
|----------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
@@ -211,7 +222,7 @@ For easier comparison, shown here is a summary of benchmark results collected on
211222
| Broadcast nested loop join | Off | 26847 | 26870 | 32 | 0.8 | 1280.2 | 1.0X |
212223
| | On | 18857 | 18928 | 84 | 1.1 | 899.2 | 1.4X |
213224

214-
### Benchmark summary on x86_64:
225+
## Benchmark summary on x86_64:
215226
Shown here is a summary of the benchmark results collected on an `x86_64` `D4s_v4` Azure virtual machine using the Azure Linux 3.0 image published by Ntegral Inc.
216227
| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
217228
|------------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------|
@@ -239,13 +250,13 @@ Shown here is a summary of the benchmark results collected on an `x86_64` `D4s_v
239250
| | On | 31254 | 31346 | 78 | 0.7 | 1490.3 | 1.2X |
240251

241252

242-
### Benchmark comparison insights
253+
## Benchmark comparison insights
243254

244-
When you compare the benchmark results you will notice that on the Azure Linux Arm64 virtual machine:
255+
When comparing the results on Arm64 vs x86_64 virtual machines:
245256

246-
- Whole-stage codegen improves performance by up to 2.8× on complex joins (e.g., with long columns).
247-
- Simple joins (e.g., on integers) show negligible performance gain, remains comparable to performance on `x86_64`.
248-
- Broadcast and shuffle-based joins benefit with 1.4× to 1.5× improvements.
249-
- Overall enabling whole-stage codegen consistently improves performance across most join types.
257+
- Whole-stage codegen improves performance by up to 2.8× on complex joins
258+
- Simple joins, such as integer joins, show negligible performance differences
259+
- Broadcast and shuffle-based joins achieve 1.4× to 1.5× improvements
260+
- Enabling whole-stage codegen consistently improves performance across most join types
250261

251-
You have successfully learnt how to deploy Apache Spark on an Azure Cobalt 100 virtual machine and measure the performance uplift.
262+
You have now benchmarked Apache Spark on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64.

0 commit comments

Comments
 (0)