diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/01.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/01.png
new file mode 100644
index 0000000000..98a272f84b
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/01.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/02.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/02.png
new file mode 100644
index 0000000000..d0b8df7cb0
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/02.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/03.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/03.png
new file mode 100644
index 0000000000..80e41973f2
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/03.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/04.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/04.png
new file mode 100644
index 0000000000..d098da4e1a
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/04.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/05.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/05.png
new file mode 100644
index 0000000000..8fa7609f69
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/05.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/06.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/06.png
new file mode 100644
index 0000000000..a78e5ee6f7
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/06.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/07.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/07.png
new file mode 100644
index 0000000000..5993f29b22
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/07.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/08.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/08.png
new file mode 100644
index 0000000000..a01e883efc
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/08.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/09.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/09.png
new file mode 100644
index 0000000000..64d714c262
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/09.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/10.png b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/10.png
new file mode 100644
index 0000000000..571783c51e
Binary files /dev/null and b/content/learning-paths/mobile-graphics-and-gaming/android_halide/Figures/10.png differ
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md
new file mode 100644
index 0000000000..71c32dd2e8
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_index.md
@@ -0,0 +1,52 @@
+---
+title: Halide Essentials. From Basics to Android Integration
+minutes_to_complete: 180
+
+who_is_this_for: This is an introductory topic for software developers interested in learning how to use Halide for image processing.
+
+learning_objectives:
+ - Understand foundational concepts of Halide and set up your development environment.
+ - Create a basic real-time image processing pipeline using Halide.
+ - Optimize image processing workflows by applying operation fusion in Halide.
+ - Integrate Halide pipelines into Android applications developed with Kotlin.
+
+prerequisites:
+ - Basic C++ knowledge
+ - Basic programming knowledge
+ - Android Studio with Android Emulator
+
+author: Dawid Borycki
+
+### Tags
+skilllevels: Introductory
+subjects: Performance and Architecture
+armips:
+ - Cortex-A
+ - Cortex-X
+operatingsystems:
+ - Android
+tools_software_languages:
+ - Android Studio
+ - Coding
+
+further_reading:
+ - resource:
+ title: Halide 19.0.0
+ link: https://halide-lang.org/docs/index.html
+ type: website
+ - resource:
+ title: Halide GitHub
+ link: https://github.com/halide/Halide
+ type: repository
+ - resource:
+ title: Halide Tutorials
+ link: https://halide-lang.org/tutorials/
+ type: website
+
+
+### FIXED, DO NOT MODIFY
+# ================================================================================
+weight: 1 # _index.md always has weight of 1 to order correctly
+layout: "learningpathall" # All files under learning paths have this same wrapper
+learning_path_main_page: "yes" # This should be surfaced when looking for related content. Only set for _index.md of learning path content.
+---
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/_next-steps.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_next-steps.md
new file mode 100644
index 0000000000..c3db0de5a2
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/_next-steps.md
@@ -0,0 +1,8 @@
+---
+# ================================================================================
+# FIXED, DO NOT MODIFY THIS FILE
+# ================================================================================
+weight: 21 # Set to always be larger than the content in this path to be at the end of the navigation.
+title: "Next Steps" # Always the same, html page title.
+layout: "learningpathall" # All files under learning paths have this same wrapper for Hugo processing.
+---
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md
new file mode 100644
index 0000000000..5e78c19811
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/android.md
@@ -0,0 +1,454 @@
+---
+# User change
+title: "Integrating Halide into an Android (Kotlin) Project"
+
+weight: 6
+
+layout: "learningpathall"
+---
+
+## Objective
+In this lesson, we’ll learn how to integrate a high-performance Halide image-processing pipeline into an Android application using Kotlin.
+
+## Overview of mobile integration with Halide
+Android is the world’s most widely-used mobile operating system, powering billions of devices across diverse markets. This vast user base makes Android an ideal target platform for developers aiming to reach a broad audience, particularly in applications requiring sophisticated image and signal processing, such as augmented reality, photography, video editing, and real-time analytics.
+
+Kotlin, now the preferred programming language for Android development, combines concise syntax with robust language features, enabling developers to write maintainable, expressive, and safe code. It offers seamless interoperability with existing Java codebases and straightforward integration with native code via JNI, simplifying the development of performant mobile applications.
+
+## Benefits of using Halide on mobile
+Integrating Halide into Android applications brings several key advantages:
+1. Performance. Halide enables significant acceleration of complex image processing algorithms, often surpassing the speed of traditional Java or Kotlin implementations by leveraging optimized code generation. By generating highly optimized native code tailored for ARM CPUs or GPUs, Halide can dramatically increase frame rates and responsiveness, essential for real-time or interactive applications.
+2. Efficiency. On mobile devices, resource efficiency translates directly to improved battery life and reduced thermal output. Halide’s scheduling strategies (such as operation fusion, tiling, parallelization, and vectorization) minimize unnecessary memory transfers, CPU usage, and GPU overhead. This optimization substantially reduces overall power consumption, extending battery life and enhancing the user experience by preventing overheating.
+3. Portability. Halide abstracts hardware-specific details, allowing developers to write a single high-level pipeline that easily targets different processor architectures and hardware configurations. Pipelines can seamlessly run on various ARM-based CPUs and GPUs commonly found in Android smartphones and tablets, enabling developers to support a wide range of devices with minimal platform-specific modifications.
+4. Custom Algorithm Integration. Halide allows developers to easily integrate their bespoke image-processing algorithms that may not be readily available or optimized in common libraries, providing full flexibility and control over application-specific performance and functionality.
+
+In short, Halide delivers high-performance image processing without sacrificing portability or efficiency, a balance particularly valuable on resource-constrained mobile devices.
+
+### Android development ecosystem and challenges
+While Android presents abundant opportunities for developers, the mobile development ecosystem brings its own set of challenges, especially for performance-intensive applications:
+1. Limited Hardware Resources. Unlike desktop or server environments, mobile devices have significant constraints on processing power, memory capacity, and battery life. Developers must optimize software meticulously to deliver smooth performance while carefully managing hardware resource consumption. Leveraging tools like Halide allows developers to overcome these constraints by optimizing computational workloads, making resource-intensive tasks feasible on constrained hardware.
+2. Cross-Compilation Complexities. Developing native code for Android requires handling multiple hardware architectures (such as armv8-a, ARM64, and sometimes x86/x86_64). Cross-compilation introduces complexities due to different instruction sets, CPU features, and performance characteristics. Managing this complexity involves careful use of the Android NDK, understanding toolchains, and correctly configuring build systems (e.g., Gradle, CMake). Halide helps mitigate these issues by abstracting away many platform-specific optimizations, automatically generating code optimized for target architectures.
+3. Image-Format Conversions (Bitmap ↔ Halide Buffer). Android typically handles images through the Bitmap class or similar platform-specific constructs, whereas Halide expects image data to be in raw, contiguous buffer formats. Developers must bridge the gap between Android-specific image representations (Bitmaps, YUV images from camera APIs, etc.) and Halide’s native buffer format. Proper management of these conversions—including considerations for pixel formats, stride alignment, and memory copying overhead—can significantly impact performance and correctness, necessitating careful design and efficient implementation of buffer-handling routines.
+
+## Project requirements
+Before integrating Halide into your Android application, ensure you have the necessary tools and libraries.
+
+### Tools and prerequisites
+1. Android Studio. [Download link](https://developer.android.com/studio).
+2. Android NDK (Native Development Kit). Can be easily installed from Android Studio (Tools → SDK Manager → SDK Tools → Android NDK).
+
+## Setting up the Android project
+### Creating the project
+1. Open Android Studio.
+2. Select New Project > Native C++.
+
+
+### Configure the project
+1. Set the project Name to Arm.Halide.AndroidDemo.
+2. Choose Kotlin as the language.
+3. Set Minimum SDK to API 24.
+4. Click Next.
+
+5. Select C++17 from the C++ Standard dropdown list.
+
+6. Click Finish.
+
+## Configuring the Android project
+Next, configure your Android project to use the files generated in the previous step. First, copy blur_threshold_android.a and blur_threshold_android.h into ArmHalideAndroidDemo/app/src/main/cpp. Ensure your cpp directory contains the following files:
+* native-lib.cpp
+* blur_threshold_android.a
+* blur_threshold_android.h
+* CMakeLists.txt
+
+Open CMakeLists.txt and modify it as follows (replace /path/to/halide with your Halide installation directory):
+```cpp
+cmake_minimum_required(VERSION 3.22.1)
+
+project("armhalideandroiddemo")
+include_directories(
+ /path/to/halide/include
+)
+
+add_library(blur_threshold_android STATIC IMPORTED)
+set_target_properties(blur_threshold_android PROPERTIES IMPORTED_LOCATION
+ ${CMAKE_CURRENT_SOURCE_DIR}/blur_threshold_android.a
+)
+
+add_library(${CMAKE_PROJECT_NAME} SHARED native-lib.cpp)
+
+target_link_libraries(${CMAKE_PROJECT_NAME}
+ blur_threshold_android
+ android
+ log)
+```
+
+Open build.gradle.kts and modify it as follows:
+
+```console
+plugins {
+ alias(libs.plugins.android.application)
+ alias(libs.plugins.kotlin.android)
+}
+
+android {
+ namespace = "com.arm.armhalideandroiddemo"
+ compileSdk = 35
+
+ defaultConfig {
+ applicationId = "com.arm.armhalideandroiddemo"
+ minSdk = 24
+ targetSdk = 34
+ versionCode = 1
+ versionName = "1.0"
+ ndk {
+ abiFilters += "arm64-v8a"
+ }
+ testInstrumentationRunner = "androidx.test.runner.AndroidJUnitRunner"
+ externalNativeBuild {
+ cmake {
+ cppFlags += "-std=c++17"
+ }
+ }
+ }
+
+ buildTypes {
+ release {
+ isMinifyEnabled = false
+ proguardFiles(
+ getDefaultProguardFile("proguard-android-optimize.txt"),
+ "proguard-rules.pro"
+ )
+ }
+ }
+ compileOptions {
+ sourceCompatibility = JavaVersion.VERSION_11
+ targetCompatibility = JavaVersion.VERSION_11
+ }
+ kotlinOptions {
+ jvmTarget = "11"
+ }
+ externalNativeBuild {
+ cmake {
+ path = file("src/main/cpp/CMakeLists.txt")
+ version = "3.22.1"
+ }
+ }
+ buildFeatures {
+ viewBinding = true
+ }
+}
+
+dependencies {
+
+ implementation(libs.androidx.core.ktx)
+ implementation(libs.androidx.appcompat)
+ implementation(libs.material)
+ implementation(libs.androidx.constraintlayout)
+ testImplementation(libs.junit)
+ androidTestImplementation(libs.androidx.junit)
+ androidTestImplementation(libs.androidx.espresso.core)
+}
+```
+
+Click the Sync Now button at the top. To verify that everything is configured correctly, click Build > Make Project in Android Studio.
+
+## UI
+Now, you'll define the application's User Interface, consisting of two buttons and an ImageView. One button loads the image, the other processes it, and the ImageView displays both the original and processed images.
+1. Open the res/layout/activity_main.xml file, and modify it as follows:
+```XML
+
+
+
+
+
+
+
+
+
+
+
+
+
+```
+
+2. In MainActivity.kt, comment out the following line:
+
+```java
+//binding.sampleText.text = stringFromJNI()
+```
+
+Now you can run the app to view the UI:
+
+
+
+## Processing
+You will now implement the image processing code. First, pick up an image you want to process. Here we use the camera man. Then, under the Arm.Halide.AndroidDemo/src/main create assets folder, and save the image under that folder as img.png.
+
+Now, open MainActivity.kt and modify it as follows:
+```java
+package com.arm.armhalideandroiddemo
+
+import android.graphics.Bitmap
+import android.graphics.BitmapFactory
+import androidx.appcompat.app.AppCompatActivity
+import android.os.Bundle
+import android.widget.Button
+import android.widget.ImageView
+import com.arm.armhalideandroiddemo.databinding.ActivityMainBinding
+import kotlinx.coroutines.CoroutineScope
+import kotlinx.coroutines.Dispatchers
+import kotlinx.coroutines.launch
+import kotlinx.coroutines.withContext
+import java.io.InputStream
+
+class MainActivity : AppCompatActivity() {
+
+ private lateinit var binding: ActivityMainBinding
+
+ private var originalBitmap: Bitmap? = null
+ private lateinit var btnLoadImage: Button
+ private lateinit var btnProcessImage: Button
+ private lateinit var imageView: ImageView
+
+ override fun onCreate(savedInstanceState: Bundle?) {
+ super.onCreate(savedInstanceState)
+
+ binding = ActivityMainBinding.inflate(layoutInflater)
+ setContentView(binding.root)
+
+ btnLoadImage = findViewById(R.id.btnLoadImage)
+ btnProcessImage = findViewById(R.id.btnProcessImage)
+ imageView = findViewById(R.id.imageView)
+
+ // Load the image from assets when the user clicks "Load Image"
+ btnLoadImage.setOnClickListener {
+ originalBitmap = loadImageFromAssets("img.png")
+ originalBitmap?.let {
+ imageView.setImageBitmap(it)
+ // Enable the process button only if the image is loaded.
+ btnProcessImage.isEnabled = true
+ }
+ }
+
+ // Process the image using Halide when the user clicks "Process Image"
+ btnProcessImage.setOnClickListener {
+ originalBitmap?.let { bmp ->
+ // Run the processing on a background thread using coroutines.
+ CoroutineScope(Dispatchers.IO).launch {
+ // Convert Bitmap to grayscale byte array.
+ val grayBytes = extractGrayScaleBytes(bmp)
+
+ // Call your native function via JNI.
+ val processedBytes = blurThresholdImage(grayBytes, bmp.width, bmp.height)
+
+ // Convert processed bytes back to a Bitmap.
+ val processedBitmap = createBitmapFromGrayBytes(processedBytes, bmp.width, bmp.height)
+
+ // Update UI on the main thread.
+ withContext(Dispatchers.Main) {
+ imageView.setImageBitmap(processedBitmap)
+ }
+ }
+ }
+ }
+ }
+
+ // Utility to load an image from the assets folder.
+ private fun loadImageFromAssets(fileName: String): Bitmap? {
+ return try {
+ val assetManager = assets
+ val istr: InputStream = assetManager.open(fileName)
+ BitmapFactory.decodeStream(istr)
+ } catch (e: Exception) {
+ e.printStackTrace()
+ null
+ }
+ }
+
+ // Convert Bitmap to a grayscale ByteArray.
+ private fun extractGrayScaleBytes(bitmap: Bitmap): ByteArray {
+ val width = bitmap.width
+ val height = bitmap.height
+ val pixels = IntArray(width * height)
+ bitmap.getPixels(pixels, 0, width, 0, 0, width, height)
+ val grayBytes = ByteArray(width * height)
+ var index = 0
+ for (pixel in pixels) {
+ val r = (pixel shr 16 and 0xFF)
+ val g = (pixel shr 8 and 0xFF)
+ val b = (pixel and 0xFF)
+ val gray = ((r + g + b) / 3).toByte()
+ grayBytes[index++] = gray
+ }
+ return grayBytes
+ }
+
+ // Convert a grayscale byte array back to a Bitmap.
+ private fun createBitmapFromGrayBytes(grayBytes: ByteArray, width: Int, height: Int): Bitmap {
+ val bitmap = Bitmap.createBitmap(width, height, Bitmap.Config.ARGB_8888)
+ val pixels = IntArray(width * height)
+ var idx = 0
+ for (i in 0 until width * height) {
+ val gray = grayBytes[idx++].toInt() and 0xFF
+ pixels[i] = (0xFF shl 24) or (gray shl 16) or (gray shl 8) or gray
+ }
+ bitmap.setPixels(pixels, 0, width, 0, 0, width, height)
+ return bitmap
+ }
+
+ external fun blurThresholdImage(inputBytes: ByteArray, width: Int, height: Int): ByteArray
+
+ companion object {
+ // Used to load the 'armhalideandroiddemo' library on application startup.
+ init {
+ System.loadLibrary("armhalideandroiddemo")
+ }
+ }
+}
+```
+
+This Kotlin Android application demonstrates integrating a Halide-generated image-processing pipeline within an Android app. The main activity (MainActivity) manages loading and processing an image stored in the application’s asset folder.
+
+When the app launches, the Process Image button is disabled. When a user taps Load Image, the app retrieves img.png from its assets directory and displays it within the ImageView, simultaneously enabling the Process Image button for further interaction.
+
+Upon pressing the Process Image button, the following sequence occurs:
+1. Background Processing. A Kotlin coroutine initiates processing on a background thread, ensuring the application’s UI remains responsive.
+2. Conversion to Grayscale. The loaded bitmap image is converted into a grayscale byte array using a simple RGB-average method, preparing it for processing by the native (JNI) layer.
+3. Native Function Invocation. This grayscale byte array, along with image dimensions, is passed to a native function (blurThresholdImage) defined via JNI. This native function is implemented using the Halide pipeline, performing operations such as blurring and thresholding directly on the image data.
+4. Post-processing. After the native function completes, the resulting processed grayscale byte array is converted back into a Bitmap image.
+5. UI Update. The coroutine then updates the displayed image (on the main UI thread) with this newly processed bitmap, providing the user immediate visual feedback.
+
+The code defines three utility methods:
+1. loadImageFromAssets, which retrieves an image from the assets folder and decodes it into a Bitmap.
+2. extractGrayScaleBytes - converts a Bitmap into a grayscale byte array suitable for native processing.
+3. createBitmapFromGrayBytes - converts a grayscale byte array back into a Bitmap for display purposes.
+
+Note that performing the grayscale conversion in Halide allows us to exploit operator fusion, further improving performance by avoiding intermediate memory accesses. This could be done as follows:
+```cpp
+// Halide variables
+Halide::Var x("x"), y("y"), c("c");
+
+// Original RGB input buffer (interleaved RGB)
+Halide::Buffer inputBuffer(inputRgbData, width, height, 3);
+
+// Convert RGB to grayscale directly in Halide pipeline
+Halide::Func grayscale("grayscale");
+grayscale(x, y) = Halide::cast(
+ 0.299f * inputBuffer(x, y, 0) +
+ 0.587f * inputBuffer(x, y, 1) +
+ 0.114f * inputBuffer(x, y, 2)
+);
+
+// Continue pipeline: Gaussian blur (example)
+Halide::Func blur("blur");
+Halide::RDom r(-1, 3, -1, 3);
+Halide::Expr kernel[3][3] = {
+ {1, 2, 1},
+ {2, 4, 2},
+ {1, 2, 1}
+};
+
+Halide::Expr blurSum = 0;
+for (int i = 0; i < 3; ++i) {
+ for (int j = 0; j < 3; ++j) {
+ blurSum += grayscale(x + r.x, y + r.y) * kernel[i][j];
+ }
+}
+blur(x, y) = Halide::cast(blurSum / 16);
+
+// Fuse grayscale and blur operations
+grayscale.compute_at(blur, x);
+```
+
+The JNI integration occurs through an external method declaration, blurThresholdImage, loaded via the companion object at app startup. The native library (armhalideandroiddemo) containing this function is compiled separately and integrated into the application (native-lib.cpp).
+
+You will now need to create blurThresholdImage function. To do so, in Android Studio put the cursor above blurThresholdImage function, and then click Create JNI function for blurThresholdImage:
+
+
+This will generate a new function in the native-lib.cpp:
+```cpp
+extern "C"
+JNIEXPORT jbyteArray JNICALL
+Java_com_arm_armhalideandroiddemo_MainActivity_blurThresholdImage(JNIEnv *env, jobject thiz,
+ jbyteArray input_bytes,
+ jint width, jint height) {
+ // TODO: implement blurThresholdImage()
+}
+```
+
+Implement this function as follows:
+```cpp
+extern "C"
+JNIEXPORT jbyteArray JNICALL
+Java_com_arm_armhalideandroiddemo_MainActivity_blurThresholdImage(JNIEnv *env, jobject thiz,
+ jbyteArray input_bytes,
+ jint width, jint height) {
+ // Get the input byte array
+ jbyte* inBytes = env->GetByteArrayElements(input_bytes, nullptr);
+ if (inBytes == nullptr) return nullptr;
+
+ // Wrap the grayscale image in a Halide::Runtime::Buffer.
+ Halide::Runtime::Buffer inputBuffer(reinterpret_cast(inBytes), width, height);
+
+ // Prepare an output buffer of the same size.
+ Halide::Runtime::Buffer outputBuffer(width, height);
+
+ // Call your Halide AOT function. Its signature is typically:
+ blur_threshold(inputBuffer, outputBuffer);
+
+ // Allocate a jbyteArray for the output.
+ jbyteArray outputArray = env->NewByteArray(width * height);
+ // Copy the data from Halide's output buffer to the jbyteArray.
+ env->SetByteArrayRegion(outputArray, 0, width * height, reinterpret_cast(outputBuffer.data()));
+
+ env->ReleaseByteArrayElements(input_bytes, inBytes, JNI_ABORT);
+ return outputArray;
+}
+```
+Then supplement the native-lib.cpp file by the following includes:
+```cpp
+#include "HalideBuffer.h"
+#include "Halide.h"
+#include "blur_threshold_android.h"
+```
+
+This C++ function acts as a bridge between Java (Kotlin) and native code. Specifically, the function blurThresholdImage is implemented using JNI, allowing it to be directly called from Kotlin. When invoked from Kotlin (through the external fun blurThresholdImage declaration), the function receives a grayscale image represented as a Java byte array (jbyteArray) along with its width and height.
+
+The input Java byte array (input_bytes) is accessed and pinned into native memory via GetByteArrayElements. This provides a direct pointer (inBytes) to the grayscale data sent from Kotlin. The raw grayscale byte data is wrapped into a Halide::Runtime::Buffer object (inputBuffer). This buffer structure is required by the Halide pipeline. An output buffer (outputBuffer) is created with the same dimensions as the input image. This buffer will store the result produced by the Halide pipeline. The native function invokes the Halide-generated AOT function blur_threshold, passing in both the input and output buffers. After processing, a new Java byte array (outputArray) is allocated to hold the processed grayscale data. The processed data from the Halide output buffer is copied into this Java array using SetByteArrayRegion. The native input buffer (inBytes) is explicitly released using ReleaseByteArrayElements, specifying JNI_ABORT as no changes were made to the input array. Finally, the processed byte array (outputArray) is returned to Kotlin.
+
+Through this JNI bridge, Kotlin can invoke high-performance native code. You can now re-run the application. Click the Load Image button, and then Process Image. You will see the following results:
+
+
+
+
+In the above code we created a new jbyteArray and copying the data explicitly, which can result in an additional overhead. To optimize performance by avoiding unnecessary memory copies, you can directly wrap Halide’s buffer in a Java-accessible ByteBuffer like so
+```java
+// Instead of allocating a new jbyteArray, create a direct ByteBuffer from Halide's buffer data.
+jobject outputBuffer = env->NewDirectByteBuffer(output.data(), width * height);
+```
+
+## Summary
+In this lesson, we’ve successfully integrated a Halide image-processing pipeline into an Android application using Kotlin. We started by setting up an Android project configured for native development with the Android NDK, employing Kotlin as the primary language. We then integrated Halide-generated static libraries and demonstrated their usage through Java Native Interface (JNI), bridging Kotlin and native code. This equips developers with the skills needed to harness Halide’s capabilities for building sophisticated, performant mobile applications on Android.
\ No newline at end of file
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md
new file mode 100644
index 0000000000..b08db87b31
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/aot-and-cross-compilation.md
@@ -0,0 +1,162 @@
+---
+# User change
+title: "Ahead-of-time and cross-compilation"
+
+weight: 5
+
+layout: "learningpathall"
+---
+
+## Ahead-of-time and cross-compilation
+One of Halide’s standout features is the ability to compile image processing pipelines ahead-of-time (AOT), enabling developers to generate optimized binary code on their host machines rather than compiling directly on target devices. This AOT compilation process allows developers to create highly efficient libraries that run effectively across diverse hardware without incurring the runtime overhead associated with just-in-time (JIT) compilation.
+
+Halide also supports robust cross-compilation capabilities. Cross-compilation means using the host version of Halide, typically running on a desktop Linux or macOS system—to target different architectures, such as ARM for Android devices. Developers can thus optimize Halide pipelines on their host machine, produce libraries specifically optimized for Android, and integrate them seamlessly into Android applications. The generated pipeline code includes essential optimizations and can embed minimal runtime support, further reducing workload on the target device and ensuring responsiveness and efficiency.
+
+## Objective
+In this section, we leverage the host version of Halide to perform AOT compilation of an image processing pipeline via cross-compilation. The resulting pipeline library is specifically tailored to Android devices (targeting, for instance, arm64-v8a ABI), while the compilation itself occurs entirely on the host system. This approach significantly accelerates development by eliminating the need to build Halide or perform JIT compilation on Android devices. It also guarantees that the resulting binaries are optimized for the intended hardware, streamlining the deployment of high-performance image processing applications on mobile platforms.
+
+## Prepare Pipeline for Android
+The procedure implemented in the following code demonstrates how Halide’s AOT compilation and cross-compilation features can be utilized to create an optimized image processing pipeline for Android. We will run Halide on our host machine (in this example, macOS) to generate a static library containing the pipeline function, which will later be invoked from an Android device. Below is a step-by-step explanation of this process.
+
+Create a new file named blur-android.cpp with the following contents:
+
+```cpp
+#include "Halide.h"
+#include
+#include // for std::string
+#include // for fixed-width integer types (e.g., uint8_t)
+using namespace Halide;
+
+int main(int argc, char** argv) {
+ if (argc < 2) {
+ std::cerr << "Usage: " << argv[0] << " \n";
+ return 1;
+ }
+
+ std::string output_basename = argv[1];
+
+ // Configure Halide Target for Android
+ Halide::Target target;
+ target.os = Halide::Target::OS::Android;
+ target.arch = Halide::Target::Arch::ARM;
+ target.bits = 64;
+ target.set_feature(Target::NoRuntime, false);
+
+ // --- Define the pipeline ---
+ // Define variables
+ Var x("x"), y("y");
+
+ // Define input parameter
+ ImageParam input(UInt(8), 2, "input");
+
+ // Create a clamped function that limits the access to within the image bounds
+ Func clamped = Halide::BoundaryConditions::repeat_edge(input);
+
+ // Now use the clamped function in processing
+ RDom r(0, 3, 0, 3);
+ Func blur("blur");
+
+ // Initialize blur accumulation
+ blur(x, y) = cast(0);
+ blur(x, y) += cast(clamped(x + r.x - 1, y + r.y - 1));
+
+ // Then continue with pipeline
+ Func blur_div("blur_div");
+ blur_div(x, y) = cast(blur(x, y) / 9);
+
+ // Thresholding
+ Func thresholded("thresholded");
+ Expr t = cast(128);
+ thresholded(x, y) = select(blur_div(x, y) > t, cast(255), cast(0));
+
+ // Simple scheduling
+ blur_div.compute_root();
+ thresholded.compute_root();
+
+ // --- AOT compile to a file ---
+ thresholded.compile_to_static_library(
+ output_basename, // base filename
+ { input }, // list of inputs
+ "blur_threshold", // name of the generated function
+ target
+ );
+
+ return 0;
+}
+```
+
+In the original implementation constants 128, 255, and 0 were implicitly treated as integers. Here, the threshold value (128) and output values (255, 0) are explicitly cast to uint8_t. This approach removes ambiguity and clearly specifies the types used, ensuring compatibility and clarity. Both approaches result in identical functionality, but explicitly casting helps emphasize the type correctness and may avoid subtle issues during cross-compilation or in certain environments.
+
+The program takes at least one command-line argument, the output base name used to generate the files (e.g., “blur_threshold_android”). Here, the target architecture is explicitly set within the code to Android ARM64:
+
+```cpp
+// Configure Halide Target for Android
+Halide::Target target;
+target.os = Halide::Target::OS::Android;
+target.arch = Halide::Target::Arch::ARM;
+target.bits = 64;
+
+// Enable Halide runtime inclusion in the generated library (needed if not linking Halide runtime separately).
+target.set_feature(Target::NoRuntime, false);
+
+// Optionally, enable hardware-specific optimizations to improve performance on ARM devices:
+// - DotProd: Optimizes matrix multiplication and convolution-like operations on ARM.
+// - ARMFp16 (half-precision floating-point operations).
+```
+
+Notes:
+* NoRuntime feature. When set to true, Halide excludes its runtime from the generated code, requiring you to link the runtime manually during the linking step. Setting it to false includes the Halide runtime within the generated library, simplifying deployment.
+* ARMFp16. Leverages ARM’s hardware support for half-precision (16-bit) floating-point computations, significantly accelerating workloads where reduced precision is acceptable, such as neural networks and image processing.
+
+We declare spatial variables (x, y) and an ImageParam named “input” representing the input image data. We use boundary clamping (clamp) to safely handle edge pixels. Then, we apply a 3x3 blur with a reduction domain (RDom). The accumulated sum is divided by 9 (the number of pixels in the neighborhood), producing an average blurred image. Lastly, thresholding is applied, producing a binary output: pixels above a certain brightness threshold (128) become white (255), while others become black (0).
+
+This section intentionally reinforces previous concepts, focusing now primarily on explicitly clarifying integration details, such as type correctness and the handling of runtime features within Halide.
+
+Simple scheduling directives (compute_root) instruct Halide to compute intermediate functions at the pipeline’s root, simplifying debugging and potentially enhancing runtime efficiency.
+
+This strategy can simplify debugging by clearly isolating computational steps and may enhance runtime efficiency by explicitly controlling intermediate storage locations.
+
+By clearly separating algorithm logic from scheduling, developers can easily test and compare different scheduling strategies,such as compute_inline, compute_root, compute_at, and more, without modifying their fundamental algorithmic code. This separation significantly accelerates iterative optimization and debugging processes, ultimately yielding better-performing code with minimal overhead.
+
+We invoke Halide’s AOT compilation function compile_to_static_library, which generates a static library (.a) containing the optimized pipeline and a corresponding header file (.h).
+
+```cpp
+thresholded.compile_to_static_library(
+ output_basename, // base filename for output files (e.g., "blur_threshold_android")
+ { input }, // list of input parameters to the pipeline
+ "blur_threshold", // the generated function name
+ target // our target configuration for Android
+);
+```
+
+This will produce:
+* A static library (blur_threshold_android.a) containing the compiled pipeline. This static library also includes Halide’s runtime functions tailored specifically for the targeted architecture (arm-64-android). Thus, no separate Halide runtime needs to be provided on the Android device when linking against this library.
+* A header file (blur_threshold_android.h) declaring the pipeline function for use in other C++/JNI code.
+
+These generated files are then ready to integrate directly into an Android project via JNI, allowing efficient execution of the optimized pipeline on Android devices. The integration process is covered in the next section.
+
+Note: JNI (Java Native Interface) is a framework that allows Java (or Kotlin) code running in a Java Virtual Machine (JVM), such as on Android, to interact with native applications and libraries written in languages like C or C++. JNI bridges the managed Java/Kotlin environment and the native, platform-specific implementations.
+
+## Compilation instructions
+To compile the pipeline-generation program on your host system, use the following commands (replace /path/to/halide with your Halide installation directory):
+```console
+export DYLD_LIBRARY_PATH=/path/to/halide/lib/libHalide.19.dylib
+g++ -std=c++17 blud-android.cpp -o blud-android \
+ -I/path/to/halide/include -L/path/to/halide/lib -lHalide \
+ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \
+ -Wl,-rpath,/path/to/halide/lib
+```
+
+Then execute the binary:
+```console
+./blur_android blur_threshold_android
+```
+
+This will produce two files:
+* blur_threshold_android.a: The static library containing your Halide pipeline.
+* blur_threshold_android.h: The header file needed to invoke the generated pipeline.
+
+We will integrate these files into our Android project in the following section.
+
+## Summary
+In this section, we’ve explored Halide’s powerful ahead-of-time (AOT) and cross-compilation capabilities, preparing an optimized image processing pipeline tailored specifically for Android devices. By using the host-based Halide compiler, we’ve generated a static library optimized for ARM64 Android architecture, incorporating safe boundary conditions, neighborhood-based blurring, and thresholding operations. This streamlined process allows seamless integration of highly optimized native code into Android applications, ensuring both development efficiency and runtime performance on mobile platforms.
\ No newline at end of file
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md
new file mode 100644
index 0000000000..a31a3b4907
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/fusion.md
@@ -0,0 +1,246 @@
+---
+# User change
+title: "Demonstrating Operation Fusion"
+
+weight: 4
+
+layout: "learningpathall"
+---
+
+## Objective
+In the previous section, we explored parallelization and tiling. Here, we focus specifically on loop fusion using Halide’s fuse directive. Loop fusion merges multiple loop indices into a single loop variable, enhancing cache locality and simplifying parallel execution.
+
+## What is operation fusion?
+Operation fusion (also known as operator fusion or kernel fusion) is a technique used in high-performance computing, especially in image and signal processing pipelines, where multiple computational steps (operations) are combined into a single processing stage. Instead of computing and storing intermediate results separately, fused operations perform calculations in one continuous pass, reducing redundant memory operations and improving efficiency.
+
+## How fusion reduces memory bandwidth and scheduling overhead
+Loop fusion combines two or more nested loops into a single loop. This technique is distinct from operation fusion (compute_at), which places the computation of one function inside another’s loop nest. While operation fusion reduces intermediate storage, loop fusion simplifies loop structure, improving cache performance and parallel efficiency.
+
+Every individual stage in a processing pipeline typically reads input data, computes intermediate results, writes these results back to memory, and then the next stage again reads this intermediate data. This repeated read-write cycle introduces significant overhead, particularly in memory-intensive applications like image processing. Operation fusion reduces this overhead by:
+1. Reducing memory accesses. Intermediate results stay in CPU registers or caches rather than being repeatedly written to and read from main memory.
+2. Improving cache utilization. Data is accessed in a contiguous manner, improving CPU cache efficiency.
+3. Reducing scheduling overhead. By executing multiple operations in a single pass, scheduling complexity and overhead are minimized.
+
+## Loop fusion in practice
+Let’s explicitly apply Halide’s loop fusion to our previously demonstrated Gaussian blur and threshold pipeline. Create a new file camera-capture-fusion.cpp, and paste there the following code:
+
+```cpp
+#include "Halide.h"
+#include
+#include
+#include
+#include
+#include
+
+using namespace cv;
+using namespace std;
+
+static inline Halide::Expr clampCoord(Halide::Expr coord, int maxCoord) {
+ return Halide::clamp(coord, 0, maxCoord - 1);
+}
+
+int main() {
+ VideoCapture cap(0);
+ if (!cap.isOpened()) {
+ cerr << "Error: Unable to open camera." << endl;
+ return -1;
+ }
+
+ while (true) {
+ Mat frame;
+ cap >> frame;
+ if (frame.empty()) {
+ cerr << "Error: Received empty frame." << endl;
+ break;
+ }
+
+ Mat gray;
+ cvtColor(frame, gray, COLOR_BGR2GRAY);
+ if (!gray.isContinuous()) {
+ gray = gray.clone();
+ }
+
+ int width = gray.cols;
+ int height = gray.rows;
+
+ Halide::Buffer inputBuffer(gray.data, width, height);
+ Halide::ImageParam input(Halide::UInt(8), 2, "input");
+ input.set(inputBuffer);
+
+ int kernel_vals[3][3] = {
+ {1, 2, 1},
+ {2, 4, 2},
+ {1, 2, 1}
+ };
+ Halide::Buffer kernelBuf(&kernel_vals[0][0], 3, 3);
+
+ Halide::Var x("x"), y("y"), xy("xy");
+ Halide::RDom r(0, 3, 0, 3);
+
+ Halide::Func blur("blur");
+ Halide::Expr val = Halide::cast(
+ input(clampCoord(x + r.x - 1, width),
+ clampCoord(y + r.y - 1, height))
+ ) * kernelBuf(r.x, r.y);
+ blur(x, y) = Halide::cast(Halide::sum(val) / 16);
+
+ Halide::Func thresholded("thresholded");
+ thresholded(x, y) = Halide::cast(Halide::select(blur(x, y) > 128, 255, 0));
+
+ // Fuse
+ thresholded.fuse(x, y, xy);
+ blur.compute_at(thresholded, xy);
+
+ Halide::Buffer outputBuffer;
+ try {
+ outputBuffer = thresholded.realize({width, height}); // 2D output as usual
+ } catch (const Halide::CompileError &e) {
+ cerr << "Halide compile error: " << e.what() << endl;
+ break;
+ } catch (const std::exception &e) {
+ cerr << "Halide pipeline error: " << e.what() << endl;
+ break;
+ }
+
+ Mat blurredThresholded(height, width, CV_8UC1, outputBuffer.data());
+ imshow("Processed Image (Fused)", blurredThresholded);
+
+ if (waitKey(30) >= 0) {
+ break;
+ }
+ }
+
+ cap.release();
+ destroyAllWindows();
+ return 0;
+}
+```
+
+The code presented here closely follows the structure from the previous examples, focusing on the Gaussian blur and thresholding pipeline implemented with Halide and OpenCV. In this particular instance, we introduce the explicit use of loop fusion using Halide’s fuse scheduling directive, complemented by operation fusion with compute_at, to showcase how both techniques can synergize to optimize performance further.
+
+In addition to these scheduling optimizations, we’ve also enhanced the exception handling within our pipeline. Specifically, we’ve included a separate catch block to detect and report Halide compilation errors explicitly. This ensures that if there’s a mistake in pipeline definition or scheduling directives that prevent the Halide pipeline from compiling, it is promptly caught and reported with clear feedback, simplifying debugging and improving robustness.
+
+The critical addition to this pipeline is the explicit loop fusion achieved with the following directive:
+```cpp
+thresholded.fuse(x, y, xy);
+```
+
+Here, the two spatial dimensions—horizontal (x) and vertical (y)—are combined into a single, linearized dimension named xy. Loop fusion significantly enhances memory access patterns by promoting data locality. This linearization of loop indices means pixels are accessed sequentially and contiguously, which aligns perfectly with how data is stored in memory. Consequently, cache efficiency is improved, as fewer cache misses occur, resulting in faster data processing. Additionally, fusing loops simplifies the task of parallelizing the computation, as there is now only one unified dimension to distribute across multiple processor cores, making it straightforward and effective.
+
+Alongside loop fusion, we apply operation fusion (compute_at) as previously discussed. This is demonstrated with the line:
+```cpp
+blur.compute_at(thresholded, xy);
+```
+
+Here, the computation of the blurred image (blur) is performed directly within the newly fused loop (xy) of the thresholding operation (thresholded). By placing the blur computation immediately before thresholding, intermediate blurred values do not need to be stored extensively in memory. Instead, they’re calculated as needed and promptly consumed, effectively eliminating redundant intermediate storage and further reducing memory bandwidth requirements.
+
+Combining these two powerful scheduling directives—loop fusion (fuse) and operation fusion (compute_at)—results in a highly optimized pipeline. The main benefits are:
+* Enhanced cache locality. Loop fusion ensures contiguous memory access patterns, greatly reducing cache misses.
+* Simplified parallelization. With a single fused loop (xy), the computational workload is easier to distribute evenly across CPU cores, maximizing parallel efficiency.
+* Reduced loop overhead: Fewer loop iterations and simpler loop structures result in reduced computational overhead, further accelerating real-time processing tasks.
+
+Together, these improvements are crucial for real-time image processing applications where high frame rates and low latency are required.
+
+Though complementary, loop fusion (fuse) and operation fusion (compute_at) target slightly different aspects of pipeline optimization:
+* Operation Fusion (compute_at). Focuses primarily on minimizing intermediate memory usage by integrating the computation of dependent operations into the loop structure of their consumers.
+* Loop Fusion (fuse). Primarily targets enhancing memory access efficiency and simplifying parallelization by merging loop dimensions.
+
+The explicit combined use of these techniques in the provided final code snippet represents a comprehensive optimization strategy, enabling Halide to deliver maximal real-time performance
+Both techniques complement each other. The provided final code snippet demonstrates their combined usage explicitly, maximizing performance.
+
+## When to use operation fusion
+Operation fusion is especially beneficial for pipelines involving multiple sequential element-wise operations. These operations perform independent transformations on individual pixels without requiring neighboring data. Element-wise operations benefit greatly from fusion since they avoid the overhead associated with repeatedly storing and loading intermediate results, significantly reducing memory bandwidth usage.
+
+Ideal use-cases for fusion are
+* Pixel intensity normalization (scaling and shifting)
+* Color-space transformations (e.g., RGB to grayscale conversion)
+* Simple arithmetic or logical operations applied pixel-by-pixel
+
+However, fusion can introduce redundancy and inefficiencies when dealing with spatial operations such as blurs or convolutions, especially if:
+* The intermediate results are used multiple times.
+* The spatial filter has a large kernel (e.g., large Gaussian blur).
+* There are multiple sequential layers of spatial filters (e.g., multiple convolution layers).
+
+Also, operation fusion is less beneficial (or even detrimental) if intermediate results are frequently reused across multiple subsequent stages or if fusing operations complicates parallelism or vectorization opportunities.
+
+Operation fusion generally improves performance by reducing memory usage, eliminating intermediate storage, and enhancing cache locality. However, fusion may be less beneficial (or even detrimental) under certain circumstances:
+* Repeated reuse of intermediate results. If the same intermediate computation is heavily reused across multiple subsequent stages, explicitly storing this intermediate result (using compute_root()) can be more efficient than recomputing it multiple times through fusion.
+* Reduced parallelism or vectorization opportunities. Aggressive fusion can complicate or even restrict parallelization and vectorization opportunities, potentially hurting performance. In such scenarios, explicitly scheduling computations separately might yield better overall efficiency.
+
+For example, consider a scenario where a computationally expensive intermediate result is reused multiple times:
+```cpp
+Halide::Var x("x"), y("y");
+Halide::Func expensive_intermediate("expensive_intermediate");
+Halide::Func stage1("stage1"), stage2("stage2"), final_stage("final_stage");
+
+// Expensive intermediate computation
+expensive_intermediate(x, y) = ...;
+
+// Multiple stages reusing the intermediate
+stage1(x, y) = expensive_intermediate(x, y) + 1;
+stage2(x, y) = expensive_intermediate(x, y) * 2;
+
+// Final stage using results from previous stages
+final_stage(x, y) = stage1(x, y) + stage2(x, y);
+```
+
+In this case, explicitly storing the intermediate computation at the root level is beneficial:
+```cpp
+expensive_intermediate.compute_root();
+```
+
+This prevents redundant recomputation, resulting in higher efficiency compared to aggressively fusing these stages. In short, fusion is particularly effective in pipelines where intermediate results are not heavily reused or where recomputation costs are minimal compared to memory overhead. Being aware of these considerations helps achieve optimal scheduling decisions tailored to your specific pipeline.
+
+
+
+### Profiling
+To profile a pipeline you can use built-in profiler. For details on how to enable and interpret Halide’s profiler, please refer to the official [Halide profiling tutorial](https://halide-lang.org/tutorials/tutorial_lesson_21_auto_scheduler_generate.html#profiling).
+
+## Summary
+In this lesson, we learned about operation fusion in Halide, a powerful technique to reduce memory bandwidth and improve computational efficiency. We explored why fusion matters, identified scenarios where fusion is most effective, and demonstrated how Halide’s scheduling constructs (compute_at, store_at, fuse) enable you to apply fusion easily and effectively. By fusing the Gaussian blur and thresholding stages, we improved the performance of our real-time image processing pipeline.
\ No newline at end of file
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md
new file mode 100644
index 0000000000..af3187aa91
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/intro.md
@@ -0,0 +1,241 @@
+---
+# User change
+title: "Background and Installation"
+
+weight: 2
+
+layout: "learningpathall"
+---
+
+## Introduction
+Halide is a powerful, open-source programming language specifically designed to simplify and optimize high-performance image and signal processing pipelines. Initially developed by researchers at MIT and Adobe in 2012, Halide addresses a critical challenge in computational imaging: efficiently mapping image-processing algorithms onto diverse hardware architectures without extensive manual tuning. It accomplishes this by clearly separating the description of an algorithm (specifying the mathematical or logical transformations applied to images or signals) from its schedule (detailing how and where those computations execute). This design enables rapid experimentation and effective optimization for various processing platforms, including CPUs, GPUs, and mobile hardware.
+
+A key advantage of Halide lies in its innovative programming model. By clearly distinguishing between algorithmic logic and scheduling decisions—such as parallelism, vectorization, memory management, and hardware-specific optimizations—developers can first focus on ensuring the correctness of their algorithms. Performance tuning can then be handled independently, significantly accelerating development cycles. This approach often yields performance that matches or even surpasses manually optimized code. As a result, Halide has seen widespread adoption across industry and academia, powering image processing systems at technology giants such as Google, Adobe, and Facebook, and enabling advanced computational photography features used by millions daily.
+
+In this learning path, you will explore Halide’s foundational concepts, set up your development environment, and create your first functional Halide application. By the end, you will understand what makes Halide uniquely suited to efficient image processing, particularly on mobile and Arm-based hardware, and be ready to build your own optimized pipelines.
+
+For broader or more general use cases, please refer to the official Halide documentation and tutorials available at halide-lang.org.
+
+The example code for this Learning Path is available in the following repositories: [here](https://github.com/dawidborycki/Arm.Halide.Hello-World.git) and [here](https://github.com/dawidborycki/Arm.Halide.AndroidDemo.git)
+
+## Key concepts in Halide
+### Separation of algorithm and schedule
+At the core of Halide’s design philosophy is the principle of clearly separating algorithms from schedules. Traditional image-processing programming tightly couples algorithmic logic with execution strategy, complicating optimization and portability. In contrast, Halide explicitly distinguishes these two components:
+* Algorithm. Defines what computations are performed—for example, image filters, pixel transformations, or other mathematical operations on image data.
+* Schedule. Specifies how and where these computations are executed, addressing critical details such as parallel execution, memory usage, caching strategies, and hardware-specific optimizations.
+
+This separation allows developers to rapidly experiment and optimize their code for different hardware architectures or performance requirements without altering the core algorithmic logic.
+
+Halide provides three key building blocks, including Functions, Vars, and Pipelines, to simplify and structure image processing algorithms. Consider the following illustrative example:
+
+```cpp
+Halide::Var x("x"), y("y"), c("c");
+Halide::Func brighter("brighter");
+
+// Define a function to increase image brightness by 50
+brighter(x, y, c) = Halide::cast(Halide::min(input(x, y, c) + 50, 255));
+```
+
+Functions (Func) represent individual computational steps or image operations. Each Func encapsulates an expression applied to pixels, allowing concise definition of complex image processing tasks. Vars symbolically represent spatial coordinates or dimensions (e.g., horizontal x, vertical y, color channel c). They specify where computations are applied in the image data Pipelines are formed by interconnecting multiple Func objects, structuring a clear workflow where the output of one stage feeds into subsequent stages, enabling modular and structured image processing.
+
+Halide is a domain-specific language (DSL) tailored explicitly for image and signal processing tasks. It provides a concise set of predefined operations and building blocks optimized for expressing complex image processing pipelines. By abstracting common computational patterns into simple yet powerful operators, Halide allows developers to succinctly define their processing logic, facilitating readability, maintainability, and easy optimization for various hardware targets.
+
+### Scheduling strategies (parallelism, vectorization, tiling)
+Halide offers several powerful scheduling strategies designed for maximum performance:
+* Parallelism. Executes computations concurrently across multiple CPU cores, significantly reducing execution time for large datasets.
+* Vectorization. Enables simultaneous processing of multiple data elements using SIMD (Single Instruction, Multiple Data) instructions available on CPUs and GPUs, greatly enhancing performance.
+* Tiling. Divides computations into smaller blocks (tiles) optimized for cache efficiency, thus improving memory locality and reducing overhead due to memory transfers.
+
+By combining these scheduling techniques, developers can achieve optimal performance tailored specifically to their target hardware architecture.
+
+Beyond manual scheduling strategies, Halide also provides an Autoscheduler, a powerful tool that automatically generates optimized schedules tailored to specific hardware architectures, further simplifying performance optimization.
+
+## System requirements and environment setup
+To start developing with Halide, your system must meet several requirements and dependencies.
+
+### Installation options
+Halide can be set up using one of two main approaches:
+* Installing pre-built binaries - pre-built binaries are convenient, quick to install, and suitable for most beginners or standard platforms (Windows, Linux, macOS). This approach is recommended for typical use cases.
+* Building Halide from source is required when pre-built binaries are unavailable for your specific environment, or if you wish to experiment with the latest Halide features or LLVM versions still under active development. This method typically requires greater familiarity with build systems and may be more suitable for advanced users.
+
+Here, we’ll use pre-built binaries:
+1. Visit the official Halide releases [page](https://github.com/halide/Halide/releases). As of this writing, the latest Halide version is v19.0.0.
+2. Download and unzip the binaries to a convenient location (e.g., /usr/local/halide on Linux/macOS or C:\halide on Windows).
+3. Optionally set environment variables to simplify further usage:
+```console
+export HALIDE_DIR=/path/to/halide
+export PATH=$HALIDE_DIR/bin:$PATH
+```
+
+To proceed futher, let's make sure to install the following components:
+1. LLVM (Halide requires LLVM to compile and execute pipelines):
+* Linux (Ubuntu):
+```console
+sudo apt-get install llvm-19-dev libclang-19-dev clang-19
+```
+* macOS (Homebrew):
+```console
+brew install llvm
+```
+2. OpenCV (for image handling in later lessons):
+* Linux (Ubuntu):
+```console
+sudo apt-get install libopencv-dev pkg-config
+```
+* macOS (Homebrew):
+```console
+brew install opencv pkg-config
+```
+
+Halide examples were tested with OpenCV 4.11.0
+
+## Your first Halide program
+Now you’re ready to build your first Halide-based application. Save the following as hello-world.cpp:
+```cpp
+#include "Halide.h"
+#include
+#include
+#include
+#include
+
+using namespace Halide;
+using namespace cv;
+
+int main() {
+ // Static path for the input image.
+ std::string imagePath = "img.png";
+
+ // Load the input image using OpenCV (BGR by default).
+ Mat input = imread(imagePath, IMREAD_COLOR);
+ // Alternative: Halide has a built-in IO function to directly load images as Halide::Buffer.
+ // Example: Halide::Buffer inputBuffer = Halide::Tools::load_image(imagePath);
+ if (input.empty()) {
+ std::cerr << "Error: Unable to load image from " << imagePath << std::endl;
+ return -1;
+ }
+
+ // Convert RGB back to BGR for correct color display in OpenCV (optional but recommended for OpenCV visualization).
+ cvtColor(input, input, COLOR_BGR2RGB);
+
+ // Wrap the OpenCV Mat data in a Halide::Buffer.
+ Buffer inputBuffer(input.data, input.cols, input.rows, input.channels());
+
+ // Example Halide pipeline definition directly using inputBuffer
+ // Define Halide pipeline variables:
+ // x, y - spatial coordinates (width, height)
+ // c - channel coordinate (R, G, B)
+ Var x("x"), y("y"), c("c");
+ Func invert("inverted");
+ invert(x, y, c) = 255 - inputBuffer(x, y, c);
+
+ // Schedule the pipeline so that the channel dimension is the innermost loop,
+ // ensuring that the output is interleaved.
+ invert.reorder(c, x, y);
+
+ // Realize the output buffer with the same dimensions as the input.
+ Buffer outputBuffer = invert.realize({input.cols, input.rows, input.channels()});
+
+ // Wrap the Halide output buffer directly into an OpenCV Mat header.
+ // CV_8UC3 indicates an 8-bit unsigned integer image (CV_8U) with 3 color channels (C3), typically representing RGB or BGR images.
+ // This does not copy data; it creates a header that refers to the same memory.
+ Mat output(input.rows, input.cols, CV_8UC3, outputBuffer.data());
+
+ // Convert from BGR to RGB for consistency (optional, but recommended if your pipeline expects RGB).
+ cvtColor(output, output, COLOR_RGB2BGR);
+
+ // Display the input and processed image.
+ imshow("Original Image", input);
+ imshow("Inverted Image", output);
+
+ // Wait indefinitely until a key is pressed.
+ waitKey(0); // Wait for a key press before closing the window.
+
+ return 0;
+}
+```
+
+This program demonstrates how to combine Halide’s image processing capabilities with OpenCV’s image I/O and display functionality. It begins by loading an image from disk using OpenCV, specifically reading from a static file named img.png (here we use a Cameraman image). Since OpenCV loads images in BGR format by default, the code immediately converts the image to RGB format so that it is compatible with Halide’s expectations.
+
+Once the image is loaded and converted, the program wraps the raw image data into a Halide buffer, capturing the image’s dimensions (width, height, and color channels). Next, the Halide pipeline is defined through a function named invert, which specifies the computations to perform on each pixel—in this case, subtracting the original pixel value from 255 to invert the colors. The pipeline definition alone does not perform any actual computation; it only describes what computations should occur and how to schedule them.
+
+The actual computation occurs when the pipeline is executed with the call to invert.realize(...). This is the step that processes the input image according to the defined pipeline and produces an output Halide buffer. The scheduling directive (invert.reorder(c, x, y)) ensures that pixel data is computed in an interleaved manner (channel-by-channel per pixel), aligning the resulting data with OpenCV’s expected memory layout for images.
+
+Finally, the processed Halide output buffer is efficiently wrapped in an OpenCV Mat header without copying pixel data. For proper display in OpenCV, which uses BGR channel ordering by default, the code converts the processed image back from RGB to BGR. The program then displays the original and inverted images in separate windows, waiting for a key press before exiting. This approach demonstrates a streamlined integration between Halide for high-performance image processing and OpenCV for convenient input and output operations.
+
+By default, Halide orders loops based on the order of variable declaration. In this example, the original ordering (x, y, c) implies processing the image pixel-by-pixel across all horizontal positions (x), then vertical positions (y), and finally channels (c). This ordering naturally produces a planar memory layout (e.g., processing all red pixels first, then green, then blue).
+
+However, the optimal loop order depends on your intended memory layout and compatibility with external libraries:
+1. Interleaved Layout (RGBRGBRGB…):
+* Commonly used by libraries such as OpenCV.
+* To achieve this, the color channel (c) should be the innermost loop, followed by horizontal (x) and then vertical (y) loops
+
+Specifically, calling:
+```cpp
+invert.reorder(c, x, y);
+```
+
+changes the loop nesting to process each pixel’s channels together (R, G, B for the first pixel, then R, G, B for the second pixel, and so on), resulting in:
+* Better memory locality and cache performance when interfacing with interleaved libraries like OpenCV.
+* Reduced overhead for subsequent image-handling operations (display, saving, or further processing).
+
+By default, OpenCV stores images in interleaved memory layout, using the HWC (Height, Width, Channel) ordering. To correctly represent this data layout in a Halide buffer, you can also explicitly use the Buffer::make_interleaved() method, which ensures the data layout is properly specified. The code snippet would look like this:
+
+```cpp
+// Wrap the OpenCV Mat data in a Halide buffer with interleaved HWC layout.
+Buffer inputBuffer = Buffer::make_interleaved(
+ input.data, input.cols, input.rows, input.channels()
+);
+```
+
+2. Planar Layout (RRR...GGG...BBB...):
+* Preferred by certain image-processing routines or hardware accelerators (e.g., some GPU kernels or certain ML frameworks).
+* Achieved naturally by Halide’s default loop ordering (x, y, c).
+
+Thus, it is essential to select loop ordering based on your specific data format requirements and integration scenario. Halide provides full flexibility, allowing you to explicitly reorder loops to match the desired memory layout efficiently.
+
+In Halide, two distinct concepts must be distinguished clearly:
+1. Loop execution order (controlled by reorder). Defines the nesting order of loops during computation. For example, to make the channel dimension (c) innermost during computation:
+
+```cpp
+invert.reorder(c, x, y);
+```
+2. Memory storage layout (controlled by reorder_storage). Defines the actual order in which data is stored in memory, such as interleaved or planar:
+
+```cpp
+invert.reorder_storage(c, x, y);
+```
+
+Using only reorder(c, x, y) affects the computational loop order but not necessarily the memory layout. The computed data could still be stored in planar order by default. Using reorder_storage(c, x, y) explicitly defines the memory layout as interleaved.
+
+
+## Compilation instructions
+Compile the program as follows (replace /path/to/halide accordingly):
+```console
+export DYLD_LIBRARY_PATH=/path/to/halide/lib/libHalide.19.dylib
+g++ -std=c++17 hello-world.cpp -o hello-world \
+ -I/path/to/halide/include -L/path/to/halide/lib -lHalide \
+ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \
+ -Wl,-rpath,/path/to/halide/lib
+```
+
+Note that, on Linux, you would set LD_LIBRARY_PATH instead:
+```console
+export LD_LIBRARY_PATH=/path/to/halide/lib/
+```
+
+Run the executable:
+```console
+./hello-world
+```
+
+You will see two windows displaying the original and inverted images:
+
+
+
+## Summary
+In this lesson, you’ve learned Halide’s foundational concepts, explored the benefits of separating algorithms and schedules, set up your development environment, and created your first functional Halide application integrated with OpenCV.
+
+While the example introduces the core concepts of Halide pipelines (such as defining computations symbolically and realizing them), it does not yet showcase the substantial benefits of explicitly separating algorithm definition from scheduling strategies.
+
+In subsequent lessons, you’ll explore advanced Halide scheduling techniques, including parallelism, vectorization, tiling, and loop fusion, which will clearly demonstrate the practical advantages of separating algorithm logic from scheduling. These techniques enable fine-grained performance optimization tailored to specific hardware without modifying algorithmic correctness.
+
diff --git a/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md
new file mode 100644
index 0000000000..3e561da393
--- /dev/null
+++ b/content/learning-paths/mobile-graphics-and-gaming/android_halide/processing-workflow.md
@@ -0,0 +1,327 @@
+---
+# User change
+title: "Building a Simple Camera Image Processing Workflow"
+
+weight: 3
+
+layout: "learningpathall"
+---
+
+## Objective
+In this section, we will build a real-time camera processing pipeline using Halide. First, we capture video frames from a webcam using OpenCV, then implement a Gaussian blur to smooth the captured images, followed by thresholding to create a clear binary output highlighting prominent image features. After establishing this pipeline, we will optimize performance further by applying Halide’s tiling strategy, a technique that enhances cache efficiency and execution speed, particularly beneficial for high-resolution or real-time applications.
+
+## Gaussian blur and thresholding
+Create a new camera-capture.cpp file and modify it as follows:
+```cpp
+#include "Halide.h"
+#include
+#include
+#include // For std::string
+#include // For uint8_t, etc.
+#include // For std::exception
+
+using namespace cv;
+using namespace std;
+
+// Clamp coordinate within [0, maxCoord - 1].
+static inline Halide::Expr clampCoord(Halide::Expr coord, int maxCoord) {
+ return Halide::clamp(coord, 0, maxCoord - 1);
+}
+
+int main() {
+ // Open the default camera.
+ VideoCapture cap(0);
+ if (!cap.isOpened()) {
+ cerr << "Error: Unable to open camera." << endl;
+ return -1;
+ }
+
+ while (true) {
+ // Capture frame.
+ Mat frame;
+ cap >> frame;
+ if (frame.empty()) {
+ cerr << "Error: Received empty frame." << endl;
+ break;
+ }
+
+ // Convert to grayscale.
+ Mat gray;
+ cvtColor(frame, gray, COLOR_BGR2GRAY);
+ if (!gray.isContinuous()) {
+ gray = gray.clone();
+ }
+
+ int width = gray.cols;
+ int height = gray.rows;
+
+ // Wrap grayscale image into Halide buffer.
+ Halide::Buffer inputBuffer(gray.data, width, height);
+
+ // Define ImageParam (symbolic representation of input image).
+ Halide::ImageParam input(Halide::UInt(8), 2, "input");
+ input.set(inputBuffer);
+
+ // Define variables representing image spatial coordinates.
+ // "x" for horizontal dimension (width), "y" for vertical dimension (height).
+ // In Halide, it’s a common convention to use short variable names such as x and y to represent image coordinates clearly and concisely. This follows a well-established mathematical and programming convention:
+ // x typically refers to the horizontal spatial dimension (width).
+ // y typically refers to the vertical spatial dimension (height).
+ Halide::Var x("x"), y("y");
+
+ // Kernel layout: [1 2 1; 2 4 2; 1 2 1], sum = 16.
+ int kernel_vals[3][3] = {
+ {1, 2, 1},
+ {2, 4, 2},
+ {1, 2, 1}
+ };
+ Halide::Buffer kernelBuf(&kernel_vals[0][0], 3, 3);
+
+ Halide::RDom r(0, 3, 0, 3);
+ Halide::Func blur("blur");
+ Halide::Expr val = Halide::cast(
+ input(clampCoord(x + r.x - 1, width),
+ clampCoord(y + r.y - 1, height))
+ ) * kernelBuf(r.x, r.y);
+
+ blur(x, y) = Halide::cast(Halide::sum(val) / 16);
+
+ // Thresholding stage.
+ Halide::Func thresholded("thresholded");
+ thresholded(x, y) = Halide::cast(
+ Halide::select(blur(x, y) > 128, 255, 0)
+ );
+
+ // Realize pipeline.
+ Halide::Buffer outputBuffer;
+ try {
+ outputBuffer = thresholded.realize({ width, height });
+ } catch (const std::exception &e) {
+ cerr << "Halide pipeline error: " << e.what() << endl;
+ break;
+ }
+
+ // Wrap output in OpenCV Mat and display.
+ Mat blurredThresholded(height, width, CV_8UC1, outputBuffer.data());
+ imshow("Processed Image", blurredThresholded);
+
+ // Wait for 30 ms (~33 FPS). Exit if any key is pressed.
+ if (waitKey(30) >= 0) {
+ break;
+ }
+ }
+
+ cap.release();
+ destroyAllWindows();
+ return 0;
+}
+```
+
+This code demonstrates a real-time image processing pipeline using Halide and OpenCV. Initially, the default camera is accessed, continuously capturing color video frames. Each captured frame is immediately converted into a grayscale image via OpenCV for simplicity.
+
+Next, the grayscale image is wrapped into a Halide buffer for processing. We define symbolic variables x and y, representing horizontal (width) and vertical (height) image coordinates, respectively.
+
+The pipeline applies a Gaussian blur using a 3×3 kernel explicitly defined in a Halide buffer:
+
+```cpp
+int kernel_vals[3][3] = {
+ {1, 2, 1},
+ {2, 4, 2},
+ {1, 2, 1}
+};
+Halide::Buffer kernelBuf(&kernel_vals[0][0], 3, 3);
+```
+
+Reason for choosing this kernel:
+* It provides effective smoothing by considering the immediate neighbors of each pixel, making it computationally lightweight yet visually effective.
+* The weights approximate a Gaussian distribution, helping to maintain image details while reducing noise and small variations.
+
+The Gaussian blur calculation utilizes a Halide reduction domain (RDom), iterating over the 3×3 neighborhood around each pixel.To handle boundary pixels safely, pixel coordinates are manually clamped within valid bounds:
+
+```cpp
+Halide::Expr val = Halide::cast(
+ input(clampCoord(x + r.x - 1, width),
+ clampCoord(y + r.y - 1, height))
+) * kernelBuf(r.x, r.y);
+
+blur(x, y) = Halide::cast(Halide::sum(val) / 16);
+```
+
+
+Here’s the Halide expression using the reduction domain (RDom):
+```cpp
+Halide::RDom r(0, 3, 0, 3);
+
+// Gaussian blur kernel weights: center pixel has weight 4,
+// edge neighbors (up, down, left, right) have weight 2,
+// and diagonal neighbors have weight 1.
+Halide::Expr weight = Halide::select(
+ (r.x == 1 && r.y == 1), 4, // center pixel
+ (r.x == 1 || r.y == 1), 2, // direct neighbors (edges)
+ 1 // diagonal neighbors (corners)
+);
+```
+
+After blurring, the pipeline applies a thresholding operation, converting the blurred image into a binary image: pixels above the intensity of 128 become white (255), while others become black (0).
+
+The final result is realized by Halide and directly wrapped into an OpenCV matrix (Mat) without extra memory copies. This processed image is displayed in real-time.
+
+The main loop continues processing and displaying images until any key is pressed, providing an interactive demonstration of Halide’s performance and seamless integration with OpenCV for real-time applications.
+
+After the Gaussian blur stage, a thresholding operation is applied. This step converts the blurred grayscale image into a binary image, assigning a value of 255 to pixels with intensity greater than 128 and 0 otherwise, thus highlighting prominent features against the background.
+
+Finally, the processed image is returned from Halide to an OpenCV matrix and displayed in real-time. The loop continues until a key is pressed, providing a smooth, interactive demonstration of Halide’s ability to accelerate and streamline real-time image processing tasks.
+
+In the above example we used manually clamped coordiantes. An alternative and often recommended approach is to leverage Halide’s built-in function, BoundaryConditions::repeat_edge. Halide internally optimizes loops based on the specified boundary conditions, effectively partitioning loops to separately handle edge pixels, improving vectorization, parallelization, and overall efficiency.
+
+The alternative implementation could look like this:
+
+```cpp
+// Use Halide's built-in boundary handling instead of manual clamping.
+Halide::Func inputClamped = Halide::BoundaryConditions::repeat_edge(input);
+
+// Offsets around the current pixel.
+Halide::Expr offsetX = x + (r.x - 1);
+Halide::Expr offsetY = y + (r.y - 1);
+
+// Directly use clamped function to safely access pixel values.
+Halide::Expr val = Halide::cast(inputClamped(offsetX, offsetY)) * weight;
+```
+
+Here, we used a fixed array for the kernel. Alternatively, you can define the 3×3 Gaussian blur kernel using the Halide select expression, clearly assigning weights based on pixel positions:
+```cpp
+// Define a reduction domain to iterate over a 3×3 neighborhood
+Halide::RDom r(0, 3, 0, 3);
+
+// Explicitly assign Gaussian kernel weights based on pixel position:
+// - 4 for the center pixel (r.x == 1 && r.y == 1)
+// - 2 for direct horizontal and vertical neighbors (either r.x or r.y is 1 but not both)
+// - 1 for corner (diagonal) neighbors
+Halide::Expr weight = Halide::select(
+ (r.x == 1 && r.y == 1), 4, // center pixel
+ (r.x == 1 || r.y == 1), 2, // direct horizontal or vertical neighbors
+ 1 // diagonal (corner) neighbors
+);
+
+// Apply the kernel weights to the neighborhood pixels
+Halide::Expr val = Halide::cast(
+ input(clampCoord(x + r.x - 1, width),
+ clampCoord(y + r.y - 1, height))
+) * weight;
+
+// Compute blurred pixel value
+blur(x, y) = Halide::cast(Halide::sum(val) / 16);
+```
+
+This expression explicitly assigns:
+* Weight = 4 for the center pixel (r.x=1, r.y=1)
+* Weight = 2 for direct horizontal and vertical neighbors (r.x=1 or r.y=1 but not both)
+* Weight = 1 for corner pixels (diagonal neighbors)
+
+## Compilation instructions
+Compile the program as follows (replace /path/to/halide accordingly):
+```console
+g++ -std=c++17 camera-capture.cpp -o camera-capture \
+ -I/path/to/halide/include -L/path/to/halide/lib -lHalide \
+ $(pkg-config --cflags --libs opencv4) -lpthread -ldl \
+ -Wl,-rpath,/path/to/halide/lib
+```
+
+Run the executable:
+```console
+./camera-capture
+```
+
+The output should look as in the figure below:
+
+
+## Parallelization and Tiling
+In this section, we will explore two complementary scheduling optimizations provided by Halide: Parallelization and Tiling. Both techniques help enhance performance but achieve it through different mechanisms—parallelization leverages multiple CPU cores, whereas tiling improves cache efficiency by optimizing data locality.
+
+Below, we’ll demonstrate each technique separately for clarity and to emphasize their distinct benefits.
+
+### Parallelization
+Parallelization is a scheduling optimization that allows computations to execute simultaneously across multiple CPU cores. By distributing the computational workload across available processing units, Halide effectively reduces the overall execution time, especially beneficial for real-time or computationally intensive image processing tasks.
+
+Let’s first apply parallelization to our existing Gaussian blur and thresholding pipeline:
+
+```cpp
+// Thresholded function (as previously defined)
+Halide::Func thresholded("thresholded");
+thresholded(x, y) = Halide::select(blur(x, y) > 128, 255, 0);
+
+// Parallelize the processing across multiple CPU cores
+thresholded.parallel(y);
+```
+
+Here, the parallel(y) directive instructs Halide to parallelize execution along the vertical dimension (y). This distributes computations along available cores on multicore CPUs.
+
+### Tiling
+Tiling is a scheduling technique that divides computations into smaller, cache-friendly blocks or tiles. This approach significantly enhances data locality, reduces memory bandwidth usage, and leverages CPU caches more efficiently. While tiling can also use parallel execution, its primary advantage comes from optimizing intermediate data storage.
+
+We’ll demonstrate tiling in two scenarios:
+1. Tiling for cache efficiency (with explicit intermediate storage)
+2. Tiling for parallelization (without explicit intermediate storage)
+
+### Tiling for enhanced cache efficiency (explicit intermediate storage)
+When intermediate results between computation stages are temporarily stored, tiling achieves maximum performance gains. Smaller intermediate tiles comfortably fit within CPU caches, greatly improving data locality and minimizing redundant memory access.
+
+Here’s how to explicitly tile the Gaussian blur computation to store intermediate results in tiles:
+
+```cpp
+// Define variables
+Halide::Var x("x"), y("y"), x_outer, y_outer, x_inner, y_inner;
+
+// Define functions
+Halide::Func blur("blur"), thresholded("thresholded");
+
+// Thresholded function definition
+thresholded(x, y) = Halide::select(blur(x, y) > 128, 255, 0);
+
+// Apply tiling to divide computation into 64×64 tiles
+thresholded.tile(x, y, x_outer, y_outer, x_inner, y_inner, 64, 64)
+ .parallel(y_outer);
+
+// Compute blur within each tile explicitly to enhance cache efficiency
+blur.compute_at(thresholded, x_outer);
+```
+
+In this scheduling:
+* tile(...) divides the image into smaller blocks (tiles), optimizing cache locality.
+* blur.compute_at(thresholded, x_outer) instructs Halide to explicitly store intermediate blur results per tile, effectively utilizing the CPU’s cache.
+
+This approach reduces memory bandwidth demands, as each tile’s intermediate results remain in cache, greatly accelerating the pipeline for large or complex operations.
+
+Recompile your application as before, then run:
+```console
+./camera-capture
+```
+
+### Tiling for parallelization (without explicit intermediate storage)
+In contrast, tiling can also facilitate parallel execution without explicitly storing intermediate results. This approach mainly leverages tiling to simplify workload partitioning across CPU cores.
+
+Here’s a simple parallel tiling approach for our pipeline:
+
+```cpp
+// Define variables
+Halide::Var x("x"), y("y"), x_outer, y_outer, x_inner, y_inner;
+
+// Thresholded function definition
+Halide::Func thresholded("thresholded");
+thresholded(x, y) = Halide::select(blur(x, y) > 128, 255, 0);
+
+// Apply simple tiling schedule to divide workload and parallelize execution
+thresholded.tile(x, y, x_outer, y_outer, x_inner, y_inner, 64, 64)
+ .parallel(y_outer);
+```
+
+Here, the tiling directive primarily divides the workload into manageable segments for parallel execution. While this also improves cache locality indirectly, the absence of explicit intermediate storage means the primary gain is parallel execution rather than direct cache efficiency.
+
+### Tiling vs. parallelization
+* Parallelization directly speeds up computations by distributing workload across CPU cores.
+* Tiling for cache efficiency explicitly stores intermediate results within tiles to maximize cache utilization, greatly reducing memory bandwidth requirements.
+* Tiling for parallelization divides workload into smaller segments, primarily to simplify parallel execution rather than optimize cache usage directly.
+
+## Summary
+In this section, we built a complete real-time image processing pipeline using Halide and OpenCV. Initially, we captured live video frames and applied Gaussian blur and thresholding to highlight image features clearly. By incorporating Halide’s tiling optimization, we also improved performance by enhancing cache efficiency and parallelizing computation. Through these steps, we demonstrated Halide’s capability to provide both concise, clear code and high performance, making it an ideal framework for demanding real-time image processing tasks.
+