Skip to content

Commit f3ad458

Browse files
authored
gsoc: SDK/SparkClient project (#4296)
Signed-off-by: shekharrajak <shekharrajak@live.com>
1 parent 2cd3ff7 commit f3ad458

File tree

1 file changed

+75
-0
lines changed

1 file changed

+75
-0
lines changed

content/en/events/upcoming-events/gsoc-2026.md

Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -399,3 +399,78 @@ Notebook workflows are commonly split across multiple files. Without visual comp
399399
- JavaScript / TypeScript (visual editor, JupyterLab extensions)
400400
- Familiarity with Jupyter notebooks and pipeline concepts
401401
- Experience or interest in working within established UI frameworks
402+
403+
### Project 12: Kubeflow SDK/SparkClient - Batch Jobs, Observability & Production Readiness
404+
405+
**Components:** [kubeflow/sdk](https://www.github.com/kubeflow/sdk) (SparkClient), [kubeflow/spark-operator](https://www.github.com/kubeflow/spark-operator)
406+
407+
**Mentors:** [@shekharrajak](https://github.com/shekharrajak), [@tariq-hasan](https://github.com/tariq-hasan)
408+
409+
**Contributor:**
410+
411+
**Details:**
412+
413+
The Kubeflow SparkClient provides a unified Python API for running Apache Spark workloads on Kubernetes. The current MVP supports interactive Spark Connect sessions (KEP-107), but lacks batch job submission, usage with other kubeflow SDK Client, kubeflow components and production observability features.
414+
415+
This project extends SparkClient to support the complete Spark workflow on Kubernetes:
416+
417+
**1. Batch Job Submission (Core Feature)**
418+
419+
Implement `submit_job()` API for submitting batch Spark jobs via SparkApplication CRD:
420+
- Python function mode: Serialize and execute user-defined functions
421+
- Script mode: Submit existing PySpark/Scala scripts
422+
- Job lifecycle: `list_jobs()`, `get_job()`, `get_job_logs()`, `wait_for_job()`, `delete_job()`, and more
423+
- Integration with existing SparkClient patterns (options, validation, error handling)
424+
425+
**2. Observability & Monitoring**
426+
427+
Build monitoring capabilities for production Spark workloads:
428+
- Metrics collection from Spark REST API (task stats, executor metrics, stage progress)
429+
- Structured event streaming (task completion, failures, stage boundaries)
430+
- Health checking and readiness probes
431+
- Optional Prometheus metrics exporter
432+
433+
**3. Data Transfer & Transform reading from Data LakeHouse**
434+
435+
Tryout reading and connecting to data warehouse and data lakehouse:
436+
- Real world usecases of ETL jobs
437+
- Transform and enrich the data
438+
- Use Kubeflow components along with SDK SparkClient
439+
440+
**4. Documentation & Examples**
441+
442+
- API reference documentation (auto-generated)
443+
- Deployment, debug guide
444+
- Troubleshooting guide
445+
- Example notebooks (Jupyter/Colab)
446+
- Examples for connecting to Spark Cluster - EMR/Apache Spark k8s
447+
448+
**Technical Architecture:**
449+
- CRD builder for SparkApplication (similar to SparkConnect)
450+
- Reuses existing validation, options, and error handling infrastructure
451+
- Use different SDK clients and kubeflow components like Notebook.
452+
453+
**Community Value:**
454+
- Completes the SparkClient vision from KEP-107
455+
- Enables end-to-end Spark workflows (interactive development, batch & all different usecases)
456+
- Aligns with Kubeflow's mission of simplifying ML infrastructure
457+
- Provides foundation for future Kubeflow Pipelines integration
458+
- Showcase different ways of using SparkClient with Kubeflow Components
459+
460+
**Related Issues/KEPs:**
461+
462+
- [KEP-107: Spark Client](https://github.com/kubeflow/sdk/blob/main/docs/proposals/107-spark-client/README.md)
463+
- [Initial SparkClient version](https://github.com/kubeflow/sdk/pull/225)
464+
- [Spark Operator SparkApplication CRD](https://github.com/kubeflow/spark-operator)
465+
466+
**Difficulty:** Hard
467+
468+
**Size:** 350 hours (Large)
469+
470+
**Skills Required/Preferred:**
471+
472+
- Python (Core development)
473+
- Kubernetes (CRDs, API, RBAC)
474+
- Apache Spark (Architecture, Configuration)
475+
- Testing (Unit, Integration, E2E)
476+
- Technical Writing (Documentation)

0 commit comments

Comments
 (0)