diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md index 5ae561b76..740f5557d 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/_index.md @@ -1,24 +1,20 @@ --- -title: Run Spark applications on the Microsoft Azure Cobalt 100 processors - -draft: true -cascade: - draft: true +title: Run Spark applications on Microsoft Azure Cobalt 100 processors minutes_to_complete: 60 who_is_this_for: This is an advanced topic that introduces Spark deployment on Microsoft Azure Cobalt 100 (Arm-based) virtual machines. It is designed for developers migrating Spark applications from x86_64 to Arm. learning_objectives: - - Provision an Azure Arm64 virtual machine using Azure console. - - Learn how to create an Azure Linux 3.0 Docker container. - - Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine. - - Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine. + - Provision an Azure Arm64 virtual machine using Azure console + - Learn how to create an Azure Linux 3.0 Docker container + - Deploy a Spark application inside an Azure Linux 3.0 Arm64-based Docker container or an Azure Linux 3.0 custom-image based Azure virtual machine + - Run a suite of Spark benchmarks to understand and evaluate performance on the Azure Cobalt 100 virtual machine prerequisites: - - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6). - - A machine with [Docker](/install-guides/docker/) installed. - - Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/). + - A [Microsoft Azure](https://azure.microsoft.com/) account with access to Cobalt 100 based instances (Dpsv6) + - A machine with [Docker](/install-guides/docker/) installed + - Familiarity with distributed computing concepts and the [Apache Spark architecture](https://spark.apache.org/docs/latest/) author: Pareena Verma @@ -35,7 +31,6 @@ tools_software_languages: - Python - Docker - operatingsystems: - Linux @@ -61,7 +56,6 @@ further_reading: link: https://hadoop.apache.org/ type: website - ### FIXED, DO NOT MODIFY # ================================================================================ weight: 1 # _index.md always has weight of 1 to order correctly diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md index 788135913..d2d1e37be 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/background.md @@ -1,25 +1,33 @@ --- -title: "Overview" - +title: Getting started with Microsoft Azure Cobalt 100, Azure Linux 3.0, and Apache Spark weight: 2 +### FIXED, DO NOT MODIFY layout: "learningpathall" --- -## What is the Azure Cobalt 100 processor? +## Key technologies for running Apache Spark on Azure Cobalt 100 + +This section introduces the key technologies you will use when running Spark applications on Microsoft Azure Cobalt 100 processors. You will learn about the Azure Cobalt 100 Arm-based processor, Azure Linux 3.0, and Apache Spark. + +## Azure Cobalt 100 processor + +Azure Cobalt 100 is Microsoft’s first-generation Arm-based processor, designed for cloud-native, scale-out Linux workloads. Based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency. Running at 3.4 GHz, it provides a dedicated physical core for each vCPU, ensuring consistent and predictable performance. + +Typical workloads include web and application servers, data analytics, open-source databases, and caching systems. -Azure’s Cobalt 100 is built on Microsoft's first-generation Arm-based processor: the Cobalt 100. Designed entirely by Microsoft and based on Arm’s Neoverse-N2 architecture, this 64-bit CPU delivers improved performance and energy efficiency across a broad spectrum of cloud-native, scale-out Linux workloads. These include web and application servers, data analytics, open-source databases, caching systems, and more. Running at 3.4 GHz, the Cobalt 100 processor allocates a dedicated physical core for each vCPU, ensuring consistent and predictable performance. +To learn more, see the Microsoft blog [Announcing the preview of new Azure virtual machines based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). -To learn more about Cobalt 100, refer to the blog [Announcing the preview of new Azure virtual machine based on the Azure Cobalt 100 processor](https://techcommunity.microsoft.com/blog/azurecompute/announcing-the-preview-of-new-azure-vms-based-on-the-azure-cobalt-100-processor/4146353). +## Azure Linux 3.0 -## Introduction to Azure Linux 3.0 +Azure Linux 3.0 is Microsoft’s lightweight Linux distribution optimized for cloud-native workloads on Azure. It is designed for performance, security, and reliability. Azure Linux 3.0 is tailored for containers, microservices, and Kubernetes. -Azure Linux 3.0 is Microsoft's in-house, lightweight Linux distribution optimized for running cloud-native workloads on Azure. Designed with performance, security, and reliability in mind, it is fully supported by Microsoft and tailored for containers, microservices, and Kubernetes. With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on energy-efficient Arm-based infrastructure, making it a powerful choice for scalable and cost-effective cloud deployments. +With native support for Arm64 (AArch64) architecture, Azure Linux 3.0 enables efficient execution of workloads on Arm-based infrastructure, making it a scalable and cost-effective choice for cloud deployments. -## Apache Spark +## Apache Spark -Apache Spark is an open-source, distributed computing system designed for fast and general-purpose big data processing. +Apache Spark is an open-source, distributed computing system for fast, general-purpose big data processing. It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for improved performance. -It provides high-level APIs in Java, Scala, Python, and R, and supports in-memory computation for increased performance. +Spark is widely used for large-scale data analytics, machine learning, and real-time data processing. -Spark is widely used for large-scale data analytics, machine learning, and real-time data processing. Learn more from the [Apache Spark official website](https://spark.apache.org/) and its [detailed official documentation](https://spark.apache.org/docs/latest/). +Learn more at the [Apache Spark official website](https://spark.apache.org/) and in the [official documentation](https://spark.apache.org/docs/latest/). diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md index 9fb6633fd..17c19de88 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/baseline.md @@ -1,16 +1,18 @@ --- -title: Functional Validation +title: Validate Apache Spark on Azure Cobalt 100 Arm64 VMs weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Run a functional test of Apache Spark on Azure Cobalt 100 -## Functional Validation -Since Apache Spark is installed successfully on your Arm virtual machine, let's now perform simple baseline testing to validate that Spark runs correctly and gives expected output. +After installing Apache Spark on your Arm64 virtual machine, you can perform simple baseline testing to validate that Spark runs correctly and produces the expected output. -Using a file editor of your choice, create a file named `test_spark.py`, and add the below content to it: +## Create a test Spark application + +Use a text editor of your choice to create a file named `test_spark.py` with the following content: ```python from pyspark.sql import SparkSession @@ -19,11 +21,18 @@ df = spark.createDataFrame([(1, "ARM64"), (2, "Azure")], ["id", "name"]) df.show() spark.stop() ``` -Execute with: + +## Run the Spark application + +Execute the test script with: + ```console spark-submit test_spark.py ``` -You should see an output similar to: + +## Example output + +You should see output similar to: ```output 25/07/22 05:16:00 INFO CodeGenerator: Code generated in 10.545923 ms @@ -35,7 +44,9 @@ You should see an output similar to: | 2|Azure| +---+-----+ ``` -Output summary: -- The output shows Spark successfully generated code **(10.5ms)** and executed a simple DataFrame operation. -- Displaying the test data **[1, "ARM64"]** and **[2, "Azure"]** before cleanly shutting down **(exitCode 0)**. This confirms a working Spark deployment on Arm64. +## Output summary + +- Spark successfully generated code (10.5 ms) and executed a simple DataFrame operation. +- The test data **[1, "ARM64"]** and **[2, "Azure"]** was displayed before cleanly shutting down (exitCode 0). +- This confirms a working Spark deployment on Arm64. diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md index c7b97d5b7..563f91239 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/benchmarking.md @@ -1,40 +1,51 @@ --- -title: Benchmark Spark +title: Benchmark Apache Spark weight: 7 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Apache Spark Internal Benchmarking -Apache Spark includes internal micro-benchmarks to evaluate the performance of core components like SQL execution, aggregation, joins, and data source reads. These benchmarks are helpful for comparing platforms such as x86_64 vs Arm64. -Below are the steps to run Spark’s built-in SQL benchmarks using the SBT-based framework. +## Benchmark Apache Spark on Azure Cobalt 100 Arm-based instances and x86_64 instances -1. Clone the Apache Spark source code -```console -git clone https://github.com/apache/spark.git -``` -This downloads the full Spark source including internal test suites and the benchmarking tools. +Apache Spark includes internal micro-benchmarks to evaluate the performance of core components such as SQL execution, aggregation, joins, and data source reads. These benchmarks are useful for comparing performance on Arm64 and x86_64 platforms in Azure. -2. Checkout the desired Spark version -```console -cd spark/ && git checkout v4.0.0 -``` -Switch to the stable Spark 4.0.0 release, which supports the latest internal benchmarking APIs. +This section shows you how to run Spark’s built-in SQL benchmarks using the SBT-based framework. -3. Build Spark with benchmarking profile enabled -```console -./build/sbt -Pbenchmarks clean package -``` -This compiles Spark and its dependencies, enabling the benchmarks build profile for performance testing. +## Steps to run Spark benchmarks -4. Run a built-in benchmark suite -```console -./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark" -``` -This executes the `JoinBenchmark`, which measures the performance of various SQL join operations (e.g., SortMergeJoin, BroadcastHashJoin) under different query plans. It helps evaluate how Spark SQL optimizes and executes join strategies, especially with and without WholeStageCodegen, a technique that compiles entire query stages into efficient bytecode for faster execution. +1. Clone the Apache Spark source code + + ```console + git clone https://github.com/apache/spark.git + ``` + This downloads the full Spark source code, including test suites and benchmarking tools + +2. Checkout the desired Spark version + + ```console + cd spark/ && git checkout v4.0.0 + ``` + Switch to the stable Spark 4.0.0 release, which supports the latest benchmarking APIs + +3. Build Spark with the benchmarking profile + + ```console + ./build/sbt -Pbenchmarks clean package + ``` + This compiles Spark and its dependencies, enabling the benchmarking build profile + +4. Run a built-in benchmark suite + + ```console + ./build/sbt -Pbenchmarks "sql/test:runMain org.apache.spark.sql.execution.benchmark.JoinBenchmark" + ``` + This runs the `JoinBenchmark`, which measures the performance of SQL join operations such as `SortMergeJoin` and `BroadcastHashJoin`. It evaluates how Spark SQL optimizes join strategies, especially with and without WholeStageCodegen + +## Example Apache Spark benchmark output + +You should see output similar to the following: -The output should look similar to: ```output [info] Running benchmark: Join w long [info] Running case: Join w long wholestage off @@ -170,7 +181,7 @@ The output should look similar to: [info] broadcast nested loop join wholestage on 18857 18928 84 1.1 899.2 1.4X [success] Total time: 1644 s (27:24), completed Jul 25, 2025, 6:27:46 AM ``` -### Benchmark Results Table Explained: +## Benchmark Results Table Explained: - **Best Time (ms):** Fastest execution time observed (in milliseconds). - **Avg Time (ms):** Average time across all iterations. @@ -180,11 +191,11 @@ The output should look similar to: - **Relative Speed comparison:** baseline (1.0X) is the slower version. {{% notice Note %}} -Benchmarking was performed in both an Azure Linux 3.0 Docker container and an Azure Linux 3.0 virtual machine. The benchmark results were found to be comparable. +Benchmark results on Azure Linux 3.0 were consistent across both Docker containers and virtual machines. {{% /notice %}} -### Benchmark summary on Arm64: +## Benchmark summary on Arm64: For easier comparison, shown here is a summary of benchmark results collected on an Arm64 `D4ps_v6` Azure virtual machine created from a custom Azure Linux 3.0 image using the AArch64 ISO. | Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative | |----------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------| @@ -211,7 +222,7 @@ For easier comparison, shown here is a summary of benchmark results collected on | Broadcast nested loop join | Off | 26847 | 26870 | 32 | 0.8 | 1280.2 | 1.0X | | | On | 18857 | 18928 | 84 | 1.1 | 899.2 | 1.4X | -### Benchmark summary on x86_64: +## Benchmark summary on x86_64: Shown here is a summary of the benchmark results collected on an `x86_64` `D4s_v4` Azure virtual machine using the Azure Linux 3.0 image published by Ntegral Inc. | Benchmark | Wholestage | Best Time (ms) | Avg Time (ms) | Stdev (ms) | Rate (M/s) | Per Row (ns) | Relative | |------------------------------------------|------------|----------------|----------------|------------|-------------|----------------|----------| @@ -239,13 +250,13 @@ Shown here is a summary of the benchmark results collected on an `x86_64` `D4s_v | | On | 31254 | 31346 | 78 | 0.7 | 1490.3 | 1.2X | -### Benchmark comparison insights +## Benchmark comparison insights -When you compare the benchmark results you will notice that on the Azure Linux Arm64 virtual machine: +When comparing the results on Arm64 vs x86_64 virtual machines: -- Whole-stage codegen improves performance by up to 2.8× on complex joins (e.g., with long columns). -- Simple joins (e.g., on integers) show negligible performance gain, remains comparable to performance on `x86_64`. -- Broadcast and shuffle-based joins benefit with 1.4× to 1.5× improvements. -- Overall enabling whole-stage codegen consistently improves performance across most join types. +- Whole-stage codegen improves performance by up to 2.8× on complex joins +- Simple joins, such as integer joins, show negligible performance differences +- Broadcast and shuffle-based joins achieve 1.4× to 1.5× improvements +- Enabling whole-stage codegen consistently improves performance across most join types -You have successfully learnt how to deploy Apache Spark on an Azure Cobalt 100 virtual machine and measure the performance uplift. +You have now benchmarked Apache Spark on an Azure Cobalt 100 Arm64 virtual machine and compared results with x86_64. diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md index 4721928d7..bd45fa59b 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/container-setup.md @@ -1,34 +1,45 @@ --- -title: Setup Azure Linux 3.0 Environment +title: Set up an Azure Linux 3.0 environment weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- +## Set up an Azure Linux 3.0 environment -You can choose between deploying your Spark workload either in an Azure Linux 3.0 Docker container or on a virtual machine created from a custom Azure Linux 3.0 image. +You can deploy your Spark workload either in an Azure Linux 3.0 Docker container or on a virtual machine created from a custom Azure Linux 3.0 image. -### Working inside Azure Linux 3.0 Docker container -The Azure Linux Container Host is an operating system image that's optimized for running container workloads on Azure Kubernetes Service (AKS). Microsoft maintains the Azure Linux Container Host and based it on CBL-Mariner, an open-source Linux distribution created by Microsoft. To know more about Azure Linux 3.0, refer to [What is Azure Linux Container Host for AKS](https://learn.microsoft.com/en-us/azure/azure-linux/intro-azure-linux). - -Azure Linux 3.0 offers support for AArch64. However, the standalone virtual machine image for Azure Linux 3.0 or CBL Mariner 3.0 is not available for Arm. To use the default software stack provided by the Microsoft, you can run a docker container with Azure Linux 3.0 as a base image, and run the Spark application inside the container. +## Work inside an Azure Linux 3.0 Docker container -#### Option 1: Run an Azure Linux 3.0 Docker Container -The [Microsoft Artifact Registry](https://mcr.microsoft.com/en-us/artifact/mar/azurelinux/base/core/about) offers updated docker image for the Azure Linux 3.0. +The Azure Linux Container Host is an operating system image optimized for running container workloads on Azure Kubernetes Service (AKS). Microsoft maintains the Azure Linux Container Host, which is based on CBL-Mariner, an open-source Linux distribution created by Microsoft. -To run a docker container with Azure Linux 3.0, install [docker](/install-guides/docker/docker-engine/), and then run the command: +To learn more, see [What is Azure Linux Container Host for AKS?](https://learn.microsoft.com/en-us/azure/azure-linux/intro-azure-linux) + +Azure Linux 3.0 supports AArch64. However, a standalone virtual machine image for Azure Linux 3.0 or CBL Mariner 3.0 is not yet available for Arm. To use the default Microsoft software stack, you can run a Docker container with Azure Linux 3.0 as the base image and run your Spark application inside the container. + +### Option 1: Run an Azure Linux 3.0 Docker container + +The [Microsoft Artifact Registry](https://mcr.microsoft.com/en-us/artifact/mar/azurelinux/base/core/about) offers updated Docker images for Azure Linux 3.0. + +To run a Docker container with Azure Linux 3.0, install [Docker](/install-guides/docker/docker-engine/) and run: ```console sudo docker run -it --rm mcr.microsoft.com/azurelinux/base/core:3.0 ``` -The default container starts up with a bash shell. `tdnf` and `dnf` are the default package managers available to use on the container. -### Option 2: Create a virtual machine instance with Azure Linux 3.0 OS image -As of now, the Azure Marketplace offers official virtual machine images of Azure Linux 3.0 only for `x86_64` based architectures, published by Ntegral Inc. While native Arm64 (AArch64) images are not yet officially available, you can create your own custom Azure Linux 3.0 virtual machine image for AArch64 using the [AArch64 ISO for Azure Linux 3.0](https://github.com/microsoft/azurelinux#iso). +The default container starts with a Bash shell. Both `tdnf` and `dnf` are available as package managers inside the container. + +### Option 2: Create a virtual machine with an Azure Linux 3.0 image + +Currently, the Azure Marketplace offers official virtual machine images of Azure Linux 3.0 only for `x86_64` architectures, published by Ntegral Inc. While native Arm64 (AArch64) images are not yet available, you can create your own custom Azure Linux 3.0 virtual machine image for AArch64 using the [AArch64 ISO for Azure Linux 3.0](https://github.com/microsoft/azurelinux#iso). + +For detailed steps, see [Create an Azure Linux 3.0 virtual machine with Cobalt 100 processors](/learning-paths/servers-and-cloud-computing/azure-vm). -Refer to [Create an Azure Linux 3.0 virtual machine with Cobalt 100 processors](/learning-paths/servers-and-cloud-computing/azure-vm) for the detailed steps. +## Next steps -Whether you choose to use an Azure Linux 3.0 Docker container, or a virtual machine created from a custom Azure Linux 3.0 image, the Spark deployment and benchmarking steps in the following sections will remain the same. +{{% notice Note %}} +Whether you use an Azure Linux 3.0 Docker container or a virtual machine created from a custom image, the Spark deployment and benchmarking steps in the following sections remain the same. +{{% /notice %}} -Once the setup is complete, you can proceed with installing and running Spark in the next section. +Once the setup is complete, continue to the next section to install and run Spark. diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/create-instance.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/create-instance.md index 6b7829d4e..a8dc18c4f 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/create-instance.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/create-instance.md @@ -1,30 +1,31 @@ --- -title: Create an Arm based cloud virtual machine using Azure Cobalt 100 +title: Create an Azure Cobalt 100 Arm64 virtual machine weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## Introduction +## Create an Azure Cobalt 100 Arm64 VM using the Azure portal -There are several ways you can create an Azure Cobalt 100 Arm-basedvirtual machine : the Microsoft Azure console, the Azure CLI tool, or using your choice of IaC (Infrastructure as Code). In this Learning Path you will use the Azure console to create a virtual machine with Arm-based Azure Cobalt 100 Processor. +You can create an Azure Cobalt 100 Arm64 virtual machine in several ways, including the Azure portal, the Azure CLI, or an Infrastructure as Code (IaC) tool. +In this Learning Path, you’ll use the Azure portal to create a VM with the Cobalt 100 processor, following a process similar to creating any other virtual machine in Azure. -#### Create an Arm-based Azure Virtual Machine +## Step-by-step: create the virtual machine -Creating a virtual machine based on Azure Cobalt 100 is no different from creating any other virtual machine in Azure. To create an Azure virtual machine, launch the Azure portal and navigate to Virtual Machines. +1. In the Azure portal, go to **Virtual Machines** and select **Create**. +2. Enter details such as **Name** and **Region**. +3. Choose the image for your virtual machine (for example, Ubuntu 24.04) and select **Arm64** as the architecture. +4. In the **Size** field, select **See all sizes**, then choose the D-Series v6 family of virtual machines. +5. Select **D4ps_v6** from the list and create the virtual machine. -Select “Create”, and fill in the details such as Name, and Region. Choose the image for your virtual machine (for example – Ubuntu 24.04) and select “Arm64” as the virtual machine architecture. +![Azure portal VM creation — Azure Cobalt 100 Arm64 virtual machine (D4ps_v6) alt-text#center](./instance-new.png "Figure 1: Create an Azure Cobalt 100 Arm64 VM in the Azure portal") -In the “Size” field, click on “See all sizes” and select the D-Series v6 family of Virtual machine. Select “D4ps_v6” from the list and create the virtual machine. - -![Instance Screenshot](./instance-new.png) - -The virtual machine should be ready and running; You can then SSH into the virtual machine using the generated PEM key, along with the Public IP address of the running instance. +Once the Arm64 virtual machine is running, you can SSH into it using the generated PEM key and the public IP address of the instance. {{% notice Note %}} -To learn more about Arm-based virtual machine in Azure, refer to “Getting Started with Microsoft Azure” in [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). +To learn more about Arm-based virtual machines in Azure, see the Learning Path [Get started with Arm-based cloud instances](/learning-paths/servers-and-cloud-computing/csp/azure). {{% /notice %}} diff --git a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/deploy.md b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/deploy.md index f7cbe6502..685b1b4d1 100644 --- a/content/learning-paths/servers-and-cloud-computing/spark-on-azure/deploy.md +++ b/content/learning-paths/servers-and-cloud-computing/spark-on-azure/deploy.md @@ -6,11 +6,12 @@ weight: 5 layout: learningpathall --- -## Install Apache Spark +## Install Apache Spark on Azure Cobalt 100 Within your running docker container image or your custom Azure Linux VM, follow the instructions to install Spark. Start by installing Java, Python, and other essential tools: -### Install Required Packages + +## Install Java, Python, and tools for Apache Spark ```console sudo tdnf update -y @@ -38,7 +39,7 @@ The output will look like: Python 3.12.9 ``` -### Install Apache Spark on Arm +## Download and install Apache Spark on Azure Cobalt 100 You can now download and configure Apache Spark on your Arm-based machine: @@ -47,7 +48,7 @@ wget https://downloads.apache.org/spark/spark-3.5.6/spark-3.5.6-bin-hadoop3.tgz tar -xzf spark-3.5.6-bin-hadoop3.tgz sudo mv spark-3.5.6-bin-hadoop3 /opt/spark ``` -### Set Environment Variables +## Configure environment variables for Apache Spark Add this line to ~/.bashrc or ~/.zshrc to make the change persistent across terminal sessions. ```cosole @@ -61,7 +62,7 @@ Apply changes immediately in your running shell: source ~/.bashrc ``` -### Verify Spark Installation +## Verify Apache Spark installation on Azure Cobalt 100 ```console spark-submit --version