gsoc: SDK/SparkClient project (#4296)

Shekharrajak · web-flow · commit f3ad458c7c1e · 2026-02-03T15:46:25.000Z
Signed-off-by: shekharrajak &lt;shekharrajak@live.com&gt;
diff --git a/content/en/events/upcoming-events/gsoc-2026.md b/content/en/events/upcoming-events/gsoc-2026.md
@@ -399,3 +399,78 @@ Notebook workflows are commonly split across multiple files. Without visual comp
 - JavaScript / TypeScript (visual editor, JupyterLab extensions)
 - Familiarity with Jupyter notebooks and pipeline concepts
 - Experience or interest in working within established UI frameworks
+
+### Project 12: Kubeflow SDK/SparkClient - Batch Jobs, Observability & Production Readiness
+
+**Components:** [kubeflow/sdk](https://www.github.com/kubeflow/sdk) (SparkClient), [kubeflow/spark-operator](https://www.github.com/kubeflow/spark-operator)
+
+**Mentors:** [@shekharrajak](https://github.com/shekharrajak), [@tariq-hasan](https://github.com/tariq-hasan)
+
+**Contributor:**
+
+**Details:**
+
+The Kubeflow SparkClient provides a unified Python API for running Apache Spark workloads on Kubernetes. The current MVP supports interactive Spark Connect sessions (KEP-107), but lacks batch job submission, usage with other kubeflow SDK Client, kubeflow components and production observability features.
+
+This project extends SparkClient to support the complete Spark workflow on Kubernetes:
+
+**1. Batch Job Submission (Core Feature)**
+
+Implement `submit_job()` API for submitting batch Spark jobs via SparkApplication CRD:
+- Python function mode: Serialize and execute user-defined functions
+- Script mode: Submit existing PySpark/Scala scripts
+- Job lifecycle: `list_jobs()`, `get_job()`, `get_job_logs()`, `wait_for_job()`, `delete_job()`, and more
+- Integration with existing SparkClient patterns (options, validation, error handling)
+
+**2. Observability & Monitoring**
+
+Build monitoring capabilities for production Spark workloads:
+- Metrics collection from Spark REST API (task stats, executor metrics, stage progress)
+- Structured event streaming (task completion, failures, stage boundaries)
+- Health checking and readiness probes
+- Optional Prometheus metrics exporter
+
+**3. Data Transfer & Transform reading from Data LakeHouse**
+
+Tryout reading and connecting to data warehouse and data lakehouse:
+- Real world usecases of ETL jobs
+- Transform and enrich the data
+- Use Kubeflow components along with SDK SparkClient
+
+**4. Documentation & Examples**
+
+- API reference documentation (auto-generated)
+- Deployment, debug guide
+- Troubleshooting guide
+- Example notebooks (Jupyter/Colab)
+- Examples for connecting to Spark Cluster - EMR/Apache Spark k8s
+
+**Technical Architecture:**
+- CRD builder for SparkApplication (similar to SparkConnect)
+- Reuses existing validation, options, and error handling infrastructure
+- Use different SDK clients and kubeflow components like Notebook.
+
+**Community Value:**
+- Completes the SparkClient vision from KEP-107
+- Enables end-to-end Spark workflows (interactive development, batch & all different usecases)
+- Aligns with Kubeflow's mission of simplifying ML infrastructure
+- Provides foundation for future Kubeflow Pipelines integration
+- Showcase different ways of using SparkClient with Kubeflow Components
+
+**Related Issues/KEPs:**
+
+- [KEP-107: Spark Client](https://github.com/kubeflow/sdk/blob/main/docs/proposals/107-spark-client/README.md)
+- [Initial SparkClient version](https://github.com/kubeflow/sdk/pull/225)
+- [Spark Operator SparkApplication CRD](https://github.com/kubeflow/spark-operator)
+
+**Difficulty:** Hard
+
+**Size:** 350 hours (Large)
+
+**Skills Required/Preferred:**
+
+- Python (Core development)
+- Kubernetes (CRDs, API, RBAC)
+- Apache Spark (Architecture, Configuration)
+- Testing (Unit, Integration, E2E)
+- Technical Writing (Documentation)