You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md
+8-14Lines changed: 8 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,24 +1,20 @@
1
1
---
2
-
title: Run Spark applications on the Microsoft Azure Cobalt 100 processors
3
-
4
-
draft: true
5
-
cascade:
6
-
draft: true
2
+
title: Run Spark applications on Microsoft Azure Cobalt 100 processors
7
3
8
4
minutes_to_complete: 60
9
5
10
6
who_is_this_for: This is an advanced topic that introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm.
11
7
12
8
learning_objectives:
13
-
- Provision an Azure Arm64 virtual machine using Azure console.
14
-
- Learn how to create an Azure Linux 3.0 Docker container.
15
-
- Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine.
16
-
- Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine.
9
+
- Provision an Azure Arm64 virtual machine using Azure console
10
+
- Learn how to create an Azure Linux 3.0 Docker container
11
+
- Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine
12
+
- Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine
17
13
18
14
prerequisites:
19
-
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6).
20
-
- A machine with [Docker](/install-guides/docker/) installed.
21
-
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/).
15
+
- A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6)
16
+
- A machine with [Docker](/install-guides/docker/) installed
17
+
- Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/)
title: Getting started with Microsoft Azure Cobalt 100, Azure Linux 3.0, and Apache Spark
4
3
weight: 2
5
4
5
+
### FIXED, DO NOT MODIFY
6
6
layout: "learningpathall"
7
7
---
8
8
9
-
## What is the Azure Cobalt 100 processor?
9
+
## Key technologies for running Apache Spark on Azure Cobalt 100
10
+
11
+
This section introduces the key technologies you will use when running Spark applications on Microsoft Azure Cobalt 100 processors. You will learn about the Azure Cobalt 100 Arm-based processor, Azure Linux 3.0, and Apache Spark.
12
+
13
+
## Azure Cobalt 100 processor
14
+
15
+
Azure Cobalt 100 is Microsoft’s first-generation Arm-based processor, designed for cloud-native, scale-out Linux workloads. Based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency. Running at 3.4 GHz, it provides a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
16
+
17
+
Typical workloads include web and application servers, data analytics, open-source databases, and caching systems.
10
18
11
-
Azure’s Cobalt 100 is built on Microsoft's first-generation Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance.
19
+
To learn more, see the Microsoft blog [Announcing the preview of new Azure virtual machines based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
12
20
13
-
To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353).
21
+
## Azure Linux 3.0
14
22
15
-
## Introduction to AzureLinux 3.0
23
+
Azure Linux 3.0 is Microsoft’s lightweight Linux distribution optimized for cloud-native workloads on Azure. It is designed for performance, security, and reliability. Azure Linux 3.0 is tailored for containers, microservices, and Kubernetes.
16
24
17
-
Azure Linux 3.0 is Microsoft's in-house, lightweight Linux distribution optimized for running cloud-native workloads on Azure. Designed with performance, security, and reliability in mind, it is fully supported by Microsoft and tailored for containers, microservices, and Kubernetes. With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on energy-efficient Arm-based infrastructure, making it a powerful choice for scalable and cost-effective cloud deployments.
25
+
With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on Arm-based infrastructure, making it a scalable and cost-effective choice for cloud deployments.
18
26
19
-
## Apache Spark
27
+
## Apache Spark
20
28
21
-
Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing.
29
+
Apache Spark is an open-source, distributed computing system for fast, general-purpose big data processing. It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for improved performance.
22
30
23
-
It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for increased performance.
31
+
Spark is widely used for large-scale data analytics, machine learning, and real-time data processing.
24
32
25
-
Spark is widely used for large-scale data analytics, machine learning, and real-time data processing. Learn more from the [Apache Spark official website](https://spark.apache.org/) and its [detailed official documentation](https://spark.apache.org/docs/latest/).
33
+
Learn more at the [Apache Spark official website](https://spark.apache.org/) and in the [official documentation](https://spark.apache.org/docs/latest/).
title: Validate Apache Spark on Azure Cobalt 100 Arm64 VMs
3
3
weight: 6
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
+
## Run a functional test of Apache Spark on Azure Cobalt 100
9
10
10
-
## Functional Validation
11
-
Since Apache Spark is installed successfully on your Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output.
11
+
After installing Apache Spark on your Arm64 virtual machine, you can perform simple baseline testing to validate that Spark runs correctly and produces the expected output.
12
12
13
-
Using a file editor of your choice, create a file named `test_spark.py`, and add the below content to it:
13
+
## Create a test Spark application
14
+
15
+
Use a text editor of your choice to create a file named `test_spark.py` with the following content:
25/07/22 05:16:00 INFO CodeGenerator: Code generated in 10.545923 ms
@@ -35,7 +44,9 @@ You should see an output similar to:
35
44
| 2|Azure|
36
45
+---+-----+
37
46
```
38
-
Output summary:
39
47
40
-
- The output shows Spark successfully generated code **(10.5ms)** and executed a simple DataFrame operation.
41
-
- Displaying the test data **[1, "ARM64"]** and **[2, "Azure"]** before cleanly shutting down **(exitCode 0)**. This confirms a working Spark deployment on Arm64.
48
+
## Output summary
49
+
50
+
- Spark successfully generated code (10.5 ms) and executed a simple DataFrame operation.
51
+
- The test data **[1, "ARM64"]** and **[2, "Azure"]** was displayed before cleanly shutting down (exitCode 0).
52
+
- This confirms a working Spark deployment on Arm64.
Copy file name to clipboardExpand all lines: content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md
+47-36Lines changed: 47 additions & 36 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,40 +1,51 @@
1
1
---
2
-
title: Benchmark Spark
2
+
title: Benchmark Apache Spark
3
3
weight: 7
4
4
5
5
### FIXED, DO NOT MODIFY
6
6
layout: learningpathall
7
7
---
8
8
9
-
## Apache Spark Internal Benchmarking
10
-
Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing platforms such as x86_64 vs Arm64.
11
-
Below are the steps to run Spark’s built-in SQL benchmarks using the SBT-based framework.
9
+
## Benchmark Apache Spark on Azure Cobalt 100 Arm-based instances and x86_64 instances
12
10
13
-
1. Clone the Apache Spark source code
14
-
```console
15
-
git clone https://github.com/apache/spark.git
16
-
```
17
-
This downloads the full Spark source including internal test suites and the benchmarking tools.
11
+
Apache Spark includes internal micro-benchmarks to evaluate the performance of core components such as SQL execution, aggregation, joins, and data source reads. These benchmarks are useful for comparing performance on Arm64 and x86_64 platforms in Azure.
18
12
19
-
2. Checkout the desired Spark version
20
-
```console
21
-
cd spark/ && git checkout v4.0.0
22
-
```
23
-
Switch to the stable Spark 4.0.0 release, which supports the latest internal benchmarking APIs.
13
+
This section shows you how to run Spark’s built-in SQL benchmarks using the SBT-based framework.
24
14
25
-
3. Build Spark with benchmarking profile enabled
26
-
```console
27
-
./build/sbt -Pbenchmarks clean package
28
-
```
29
-
This compiles Spark and its dependencies, enabling the benchmarks build profile for performance testing.
This executes the `JoinBenchmark`, which measures the performance of various SQL join operations (e.g., SortMergeJoin, BroadcastHashJoin) under different query plans. It helps evaluate how Spark SQL optimizes and executes join strategies, especially with and without WholeStageCodegen, a technique that compiles entire query stages into efficient bytecode for faster execution.
17
+
1. Clone the Apache Spark source code
18
+
19
+
```console
20
+
git clone https://github.com/apache/spark.git
21
+
```
22
+
This downloads the full Spark source code, including test suites and benchmarking tools
23
+
24
+
2. Checkout the desired Spark version
25
+
26
+
```console
27
+
cd spark/ && git checkout v4.0.0
28
+
```
29
+
Switch to the stable Spark 4.0.0 release, which supports the latest benchmarking APIs
30
+
31
+
3. Build Spark with the benchmarking profile
32
+
33
+
```console
34
+
./build/sbt -Pbenchmarks clean package
35
+
```
36
+
This compiles Spark and its dependencies, enabling the benchmarking build profile
This runs the `JoinBenchmark`, which measures the performance of SQL join operations such as `SortMergeJoin` and `BroadcastHashJoin`. It evaluates how Spark SQL optimizes join strategies, especially with and without WholeStageCodegen
44
+
45
+
## Example Apache Spark benchmark output
46
+
47
+
You should see output similar to the following:
36
48
37
-
The output should look similar to:
38
49
```output
39
50
[info] Running benchmark: Join w long
40
51
[info] Running case: Join w long wholestage off
@@ -170,7 +181,7 @@ The output should look similar to:
[success] Total time: 1644 s (27:24), completed Jul 25, 2025, 6:27:46 AM
172
183
```
173
-
###Benchmark Results Table Explained:
184
+
## Benchmark Results Table Explained:
174
185
175
186
-**Best Time (ms):** Fastest execution time observed (in milliseconds).
176
187
-**Avg Time (ms):** Average time across all iterations.
@@ -180,11 +191,11 @@ The output should look similar to:
180
191
-**Relative Speed comparison:** baseline (1.0X) is the slower version.
181
192
182
193
{{% notice Note %}}
183
-
Benchmarking was performed in both an Azure Linux 3.0 Docker container and an Azure Linux 3.0 virtual machine. The benchmark results were found to be comparable.
194
+
Benchmark results on Azure Linux 3.0 were consistent across both Docker containers and virtual machines.
184
195
{{% /notice %}}
185
196
186
197
187
-
###Benchmark summary on Arm64:
198
+
## Benchmark summary on Arm64:
188
199
For easier comparison, shown here is a summary of benchmark results collected on an Arm64 `D4ps_v6` Azure virtual machine created from a custom Azure Linux 3.0 image using the AArch64 ISO.
189
200
| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
Shown here is a summary of the benchmark results collected on an `x86_64``D4s_v4` Azure virtual machine using the Azure Linux 3.0 image published by Ntegral Inc.
216
227
| Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative |
0 commit comments