Skip to content

Commit 168a82a

Browse files
authored
gsoc: add kubeflow sdk mcp as gsoc 2026 project idea (#4290)
Signed-off-by: Akash Jaiswal <akashjaiswal3846@gmail.com>
1 parent 94ce4f2 commit 168a82a

File tree

1 file changed

+40
-0
lines changed

1 file changed

+40
-0
lines changed

content/en/events/upcoming-events/gsoc-2026.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -257,3 +257,43 @@ This will therefore also include working with maintainers of other components su
257257
- GitHub Actions
258258
- Bash
259259
- Community Coordination
260+
261+
### Project 6: MCP Server for Kubeflow SDK
262+
263+
**Components:** [kubeflow/sdk](https://github.com/kubeflow/sdk), [kubeflow/trainer](https://github.com/kubeflow/trainer)
264+
265+
**Mentors:** [@jaiakash](https://github.com/jaiakash), [@dhanishaphadate](https://github.com/dhanishaphadate), [@abhijeet-dhumal](https://github.com/abhijeet-dhumal)
266+
267+
**Contributor:** [TBD]
268+
269+
**Details:**
270+
The Kubeflow SDK allows users with limited Kubernetes knowledge to use standard Python APIs to interact with the Kubeflow ecosystem. Documentation: https://sdk.kubeflow.org/en/latest/index.html
271+
272+
Most of us use LLMs to create/debug code for jobs, models, etc., but currently there is no mechanism for the LLM to see TrainJob status, debug a crash loop, or provide consolidated metrics about previous tasks. We want to extend and improve the Developer Experience (DX) with a Model Context Protocol (MCP) server for the Kubeflow ecosystem.
273+
274+
We have a [kubeflow/community#936](https://github.com/kubeflow/community/issues/936) and an existing MVP for this project. The contributor will extend the MCP server to cover additional use cases, improve error handling, add comprehensive documentation, and potentially integrate with other Kubeflow components like Model Registry.
275+
276+
**Core Deliverables:**
277+
278+
- MCP tools for TrainJob lifecycle (`fine_tune`, `get_training_job`, `list_training_jobs`, `delete_training_job`)
279+
- Pre-flight validation (`get_cluster_resources`, `estimate_resources`, `check_training_prerequisites`)
280+
- Job observability (`get_training_logs`, `get_job_events`)
281+
- Storage setup (`setup_training_storage`)
282+
283+
**Stretch Goals:**
284+
- Policy-based access control (persona-based RBAC)
285+
- Custom trainer support (`run_custom_training`, `run_container_job`)
286+
- Integration with Model Registry MCP catalog
287+
- Progress tracking (pending [KEP-937](https://github.com/kubeflow/community/pull/937))
288+
289+
Tracking issue: https://github.com/kubeflow/sdk/issues/238
290+
291+
**Difficulty:** Medium
292+
293+
**Size:** 175 hours (Medium)
294+
295+
**Skills Required/Preferred:**
296+
- Experience with LLM / MCP development.
297+
- Familiarity with the Kubeflow SDK and Trainer codebase.
298+
- Understanding of the Kubeflow Ecosystem and basic Kubernetes concepts.
299+
- Engage and contribute to Kubeflow community on Slack and GitHub.

0 commit comments

Comments
 (0)