@@ -399,3 +399,78 @@ Notebook workflows are commonly split across multiple files. Without visual comp
399399- JavaScript / TypeScript (visual editor, JupyterLab extensions)
400400- Familiarity with Jupyter notebooks and pipeline concepts
401401- Experience or interest in working within established UI frameworks
402+
403+ ### Project 12: Kubeflow SDK/SparkClient - Batch Jobs, Observability & Production Readiness
404+
405+ ** Components:** [ kubeflow/sdk] ( https://www.github.com/kubeflow/sdk ) (SparkClient), [ kubeflow/spark-operator] ( https://www.github.com/kubeflow/spark-operator )
406+
407+ ** Mentors:** [ @shekharrajak ] ( https://github.com/shekharrajak ) , [ @tariq-hasan ] ( https://github.com/tariq-hasan )
408+
409+ ** Contributor:**
410+
411+ ** Details:**
412+
413+ The Kubeflow SparkClient provides a unified Python API for running Apache Spark workloads on Kubernetes. The current MVP supports interactive Spark Connect sessions (KEP-107), but lacks batch job submission, usage with other kubeflow SDK Client, kubeflow components and production observability features.
414+
415+ This project extends SparkClient to support the complete Spark workflow on Kubernetes:
416+
417+ ** 1. Batch Job Submission (Core Feature)**
418+
419+ Implement ` submit_job() ` API for submitting batch Spark jobs via SparkApplication CRD:
420+ - Python function mode: Serialize and execute user-defined functions
421+ - Script mode: Submit existing PySpark/Scala scripts
422+ - Job lifecycle: ` list_jobs() ` , ` get_job() ` , ` get_job_logs() ` , ` wait_for_job() ` , ` delete_job() ` , and more
423+ - Integration with existing SparkClient patterns (options, validation, error handling)
424+
425+ ** 2. Observability & Monitoring**
426+
427+ Build monitoring capabilities for production Spark workloads:
428+ - Metrics collection from Spark REST API (task stats, executor metrics, stage progress)
429+ - Structured event streaming (task completion, failures, stage boundaries)
430+ - Health checking and readiness probes
431+ - Optional Prometheus metrics exporter
432+
433+ ** 3. Data Transfer & Transform reading from Data LakeHouse**
434+
435+ Tryout reading and connecting to data warehouse and data lakehouse:
436+ - Real world usecases of ETL jobs
437+ - Transform and enrich the data
438+ - Use Kubeflow components along with SDK SparkClient
439+
440+ ** 4. Documentation & Examples**
441+
442+ - API reference documentation (auto-generated)
443+ - Deployment, debug guide
444+ - Troubleshooting guide
445+ - Example notebooks (Jupyter/Colab)
446+ - Examples for connecting to Spark Cluster - EMR/Apache Spark k8s
447+
448+ ** Technical Architecture:**
449+ - CRD builder for SparkApplication (similar to SparkConnect)
450+ - Reuses existing validation, options, and error handling infrastructure
451+ - Use different SDK clients and kubeflow components like Notebook.
452+
453+ ** Community Value:**
454+ - Completes the SparkClient vision from KEP-107
455+ - Enables end-to-end Spark workflows (interactive development, batch & all different usecases)
456+ - Aligns with Kubeflow's mission of simplifying ML infrastructure
457+ - Provides foundation for future Kubeflow Pipelines integration
458+ - Showcase different ways of using SparkClient with Kubeflow Components
459+
460+ ** Related Issues/KEPs:**
461+
462+ - [ KEP-107: Spark Client] ( https://github.com/kubeflow/sdk/blob/main/docs/proposals/107-spark-client/README.md )
463+ - [ Initial SparkClient version] ( https://github.com/kubeflow/sdk/pull/225 )
464+ - [ Spark Operator SparkApplication CRD] ( https://github.com/kubeflow/spark-operator )
465+
466+ ** Difficulty:** Hard
467+
468+ ** Size:** 350 hours (Large)
469+
470+ ** Skills Required/Preferred:**
471+
472+ - Python (Core development)
473+ - Kubernetes (CRDs, API, RBAC)
474+ - Apache Spark (Architecture, Configuration)
475+ - Testing (Unit, Integration, E2E)
476+ - Technical Writing (Documentation)
0 commit comments