Skip to content

Commit 822043b

Browse files
gsoc: Integrate Kubeflow SDK with OpenTelemetry for GSoC 2026 (#4294)
* Add GSoC 2026 project proposal for Kubeflow SDK OpenTelemetry integration Signed-off-by: Dhanisha Phadate <dhanisha.6@gmail.com> * chore: retrigger CI Signed-off-by: Dhanisha Phadate <dhanisha.6@gmail.com> * Clean the file Signed-off-by: Dhanisha Phadate <dhanisha.6@gmail.com> * Refactoring project scope Signed-off-by: Dhanisha Phadate <dhanisha.6@gmail.com> --------- Signed-off-by: Dhanisha Phadate <dhanisha.6@gmail.com>
1 parent f3ad458 commit 822043b

File tree

1 file changed

+66
-1
lines changed

1 file changed

+66
-1
lines changed

content/en/events/upcoming-events/gsoc-2026.md

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -298,6 +298,70 @@ Tracking issue: https://github.com/kubeflow/sdk/issues/238
298298
- Understanding of the Kubeflow Ecosystem and basic Kubernetes concepts.
299299
- Engage and contribute to Kubeflow community on Slack and GitHub.
300300

301+
### Project 7 : Integrate Kubeflow SDK with OpenTelemetry
302+
303+
**Components:** [kubeflow/sdk](https://www.github.com/kubeflow/sdk)
304+
305+
**Mentors:** [@kramaranya](https://github.com/kramaranya), [@dhanishaphadate](https://github.com/dhanishaphadate), [@jaiakash](https://www.github.com/jaiakash)
306+
307+
**Contributor:**
308+
309+
**Details:**
310+
311+
The Kubeflow SDK enables users with limited Kubernetes knowledge to interact with the Kubeflow ecosystem using standard Python APIs. As AI/ML workloads become more complex and distributed, observability into pipeline execution, model training, and inference workflows becomes critical.
312+
313+
This project aims to integrate the Kubeflow SDK with OpenTelemetry (OTel) to provide standardized, vendor-neutral telemetry for Kubeflow-based workloads. The integration will enable end-to-end visibility into SDK operations by capturing distributed traces, metrics, and logs across pipeline compilation, submission, execution, and training lifecycles.
314+
315+
The project will also explore leveraging existing OpenTelemetry and Generative AI instrumentation patterns—such as span conventions for model execution, prompt handling, and inference steps—where applicable.
316+
317+
[Kubeflow SDK Documentation](https://sdk.kubeflow.org/en/latest/index.html)
318+
319+
[Opentelemetry for genAI](https://opentelemetry.io/blog/2024/otel-generative-ai/)
320+
321+
[Issue](https://github.com/kubeflow/sdk/issues/164)
322+
323+
**Features Expected:**
324+
325+
- Add OpenTelemetry instrumentation to key Kubeflow SDK components
326+
- Enable distributed tracing for pipeline execution and SDK operations
327+
- Collect and export metrics related to AI/ML workloads
328+
- Provide configurable OTel exporters and sampling options
329+
- Documentation and examples demonstrating observability setup and usage
330+
- cover below SDK clients
331+
332+
| SDK Client | Component |
333+
|------------|-----------|
334+
| `TrainerClient` | Kubeflow Trainer |
335+
| `PipelinesClient` | Kubeflow Pipelines |
336+
337+
```
338+
kubeflow/
339+
├── trainer/ # TrainerClient - distributed training & fine-tuning
340+
├── optimizer/ # OptimizerClient - Katib AutoML & hyperparameter tuning
341+
├── hub/ # ModelRegistryClient - model artifact management
342+
└── common/ # Shared utilities across clients
343+
```
344+
345+
Bonus requirement to complete
346+
| SDK Client | Component |
347+
|------------|-----------|
348+
| `OptimizerClient` | Kubeflow Katib |
349+
| `ModelRegistryClient` | Model Registry |
350+
| `SparkClient` | Spark Operator |
351+
352+
353+
**Difficulty:** [intermediate|hard]
354+
355+
**Size:** [350 hours]
356+
357+
**Skills Required/Preferred:**
358+
359+
- Python
360+
- Understanding of the Kubeflow Ecosystem (preferred)
361+
- OpenTelemetry (tracing, metrics, logging)
362+
- Distributed systems and observability concepts
363+
- Kubernetes and CRDs
364+
301365
### Project 10: Dynamic LLM Trainer Framework for Kubeflow
302366

303367
**Components:**
@@ -399,6 +463,7 @@ Notebook workflows are commonly split across multiple files. Without visual comp
399463
- JavaScript / TypeScript (visual editor, JupyterLab extensions)
400464
- Familiarity with Jupyter notebooks and pipeline concepts
401465
- Experience or interest in working within established UI frameworks
466+
----
402467

403468
### Project 12: Kubeflow SDK/SparkClient - Batch Jobs, Observability & Production Readiness
404469

@@ -473,4 +538,4 @@ Tryout reading and connecting to data warehouse and data lakehouse:
473538
- Kubernetes (CRDs, API, RBAC)
474539
- Apache Spark (Architecture, Configuration)
475540
- Testing (Unit, Integration, E2E)
476-
- Technical Writing (Documentation)
541+
- Technical Writing (Documentation)

0 commit comments

Comments
 (0)