You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
gsoc: Add End-to-End ARM64 Support & Validation on Kubeflow for GSoC 2026 proposal (#4299)
* gsoc: Add a new proposal for End-to-End Arm64 support and validation on Kubeflow into GSOC 2026
Add a new proposal for End-to-End Arm64 support and validation on Kubeflow into GSOC 2026
Signed-off-by: Jeffery T. (mrdojojo) <113143099+jtu-ampere@users.noreply.github.com>
Co-Authored-By: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Re-trigger CI
Signed-off-by: Jeffery T. (mrdojojo) <113143099+jtu-ampere@users.noreply.github.com>
* Re-trigger CI
Signed-off-by: jtu-ampere <113143099+jtu-ampere@users.noreply.github.com>
Signed-off-by: Jeffery T. (mrdojojo) <113143099+jtu-ampere@users.noreply.github.com>
* Re-trigger CI
Signed-off-by: Jeffery T. (mrdojojo) <113143099+jtu-ampere@users.noreply.github.com>
---------
Signed-off-by: Jeffery T. (mrdojojo) <113143099+jtu-ampere@users.noreply.github.com>
Signed-off-by: jtu-ampere <113143099+jtu-ampere@users.noreply.github.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
As development teams increasingly move to Apple Silicon ([M-series chips](https://en.wikipedia.org/wiki/Apple_silicon#M-series_SoCs)) and production workloads shift to cost-efficient ARM-based cloud instances (like [OCI Ampere](https://www.oracle.com/cloud/compute/arm/), [Google Axion](https://cloud.google.com/products/axion), and [AWS Graviton](https://aws.amazon.com/ec2/graviton/) ), ARM64 support is a critical requirement for the future of Kubeflow.
180
+
181
+
Currently, support is fragmented. The ARM Contributions Team aims to close this gap by establishing First-Class Citizen support for ARM64 across the entire Kubeflow Reference Platform. This initiative is not just about compiling binaries; it is about validating the "Kubeflow Platform" experience to ensure it is robust, reproducible, and ready for diverse environments.
182
+
183
+
**Strategic Alignment:**
184
+
This work directly supports the Kubeflow Platform Definition. By validating the end-to-end platform on non-x86 architectures, the team serves as a critical quality gate, ensuring that "Kubeflow" remains a consistent standard regardless of the underlying hardware.
185
+
186
+
**Collaboration & History:**
187
+
188
+
This project builds upon the extensive groundwork laid by the ARM Support Team, who have previously validated and built many of these images. The goal is to upstream this foundational work—porting validated Dockerfiles, build flags, and image tags into the official Kubeflow repositories—effectively making the community's "best effort" success the official standard.
189
+
Scope & Deliverables
190
+
191
+
**Scope & Deliverables:**
192
+
1. Multi-Arch Build System (CI/CD)
193
+
194
+
**Audit & Standardization:** The team will identify every container image in the official kubeflow/manifests release that lacks an ARM64 variant.
195
+
196
+
**Pipeline Implementation:** Update build systems (GitHub Actions/Prow) to generate multi-arch manifests (AMD64/ARM64) automatically on release. The goal is a single tag (e.g., :v2.0.0) that pulls the correct image for the host architecture.
197
+
198
+
2. Platform Manifest Validation
199
+
200
+
**Architecture Agnosticism:** Ensure official Kustomize manifests do not hardcode architecture-specific SHA hashes or incompatible image tags, ensuring the manifests apply cleanly regardless of the node architecture.
201
+
202
+
3. Infrastructure: Cloud & Edge
203
+
204
+
**Cloud Validation (OCI):** Leverage Oracle Cloud Infrastructure (OCI) Ampere A1 instances to maintain a persistent "Golden" test environment.
205
+
206
+
**Stretch Goal:** On-Premise & Edge Demonstration: A key stretch goal for this team is to demonstrate Kubeflow running on on-premise ARM hardware.
207
+
208
+
**The "Why":** This serves as the ultimate proof of Kubeflow's portability. By successfully deploying to an edge environment (outside of managed cloud services), we demonstrate that Kubeflow is truly infrastructure-agnostic and ready for Edge AI use cases.
209
+
210
+
4. End-to-End (E2E) Platform QA
211
+
212
+
**Full Suite Testing:** Run the full Kubeflow End-to-End test suite on ARM infrastructure to catch architecture-specific bugs (e.g., generic libc dependencies, JIT compiler issues in TensorFlow/PyTorch).
213
+
214
+
**Documentation & "Golden Data":** Generate a "Golden Data" set of known-good configurations for running Kubeflow on ARM. This includes documentation on "gotchas" for users running local development clusters on Apple Silicon (Kind/Minikube).
215
+
216
+
**Difficulty:** Medium/Hard (Depends on CI/CD complexity)
217
+
218
+
**Size:** 350 hours
219
+
220
+
**Tracking & References:**
221
+
222
+
* KFP Issue: [Build and publish ARM images for KFP #10309](https://github.com/kubeflow/pipelines/issues/10309)
223
+
* Manifests Issue: [Support for the aarch64 architecture #2745](https://github.com/kubeflow/manifests/issues/2745)
224
+
225
+
**Team Capabilities & Stack:**
226
+
227
+
* Docker/Containerization: Deep understanding of multi-arch builds (docker buildx, manifests).
0 commit comments