-
Notifications
You must be signed in to change notification settings - Fork 71
Description
Reference: GSoC 2026 Agentic RAG Specification - Phase 1: Ingestion & Foundation
The Problem
Today, the docs-agent can answer questions about Kubeflow's prose documentation. The existing KFP pipeline (pipelines/kubeflow-pipeline.py) crawls kubeflow/website for Markdown files, and was recently extended to also ingest GitHub Issues and their comments in #8 - giving the agent access to real-world troubleshooting threads. The issues ingest component has not been wired into the pipeline yet. It will be fixed by this PR, which is awaiting review. Both data sources are stored in Milvus.
But Kubeflow users frequently ask questions whose answers live in code, not documentation:
"What are the default resource limits for the Notebook Controller?"
"How is the Katib controller's RBAC configured?"
"What environment variables does the KFP API server expect?"
The answers to all of these exist in the actual YAML manifests, Python source files, and Kustomize overlays inside repositories like kubeflow/manifests, kubeflow/pipelines, kubeflow/notebooks, etc. These files have never been ingested into the vector database.
The natural evolution of the ingestion layer is: Docs → Issues → Code. With documentation and issues already covered, this proposal describes the next step: a new Kubeflow Pipeline that will ingest code repositories - starting with kubeflow/manifests and make them searchable by the docs-agent.
Why the Existing Pipeline Cannot be Reused
The existing documentation pipeline uses RecursiveCharacterTextSplitter from LangChain. This chunker works by blindly slicing text at a fixed character count (e.g., every 1000 characters), with some overlap.
This works perfectly for English prose. A paragraph that gets split across two chunks is still mostly readable.
But it destroys code.
Example: What Happens When You Naively Chunk YAML
Consider this simplified multi-resource YAML file from kubeflow/manifests:
apiVersion: apps/v1
kind: Deployment
metadata:
name: notebook-controller
namespace: kubeflow
spec:
replicas: 1
template:
spec:
containers:
- name: manager
image: kubeflownotebookswg/notebook-controller
resources:
limits:
cpu: "1"
memory: "1Gi"
env:
- name: CLUSTER_DOMAIN
value: cluster.local
---
apiVersion: v1
kind: Service
metadata:
name: notebook-controller-service
namespace: kubeflow
spec:
ports:
- port: 443If a naive 1000-character chunker splits this file, it will produce something like:
| Chunk | Content |
|---|---|
| Chunk 1 | Lines 1–15: The Deployment header, metadata, and the start of resources: |
| Chunk 2 | Lines 16–30: limits: cpu: "1", the env vars, the --- separator, and the start of the Service |
Chunk 2 is now broken. It contains limits: cpu: "1" but it has completely lost the context that this belongs to a Deployment named notebook-controller. Worse, the chunk also mixes the tail of the Deployment with the head of a completely unrelated Service resource.
AST-Aware Chunking
** Abstract Syntax Tree.** It is a tree-shaped representation of a file's structure, not just its raw text.
Instead of slicing at character boundaries, an AST-aware chunker understands the grammar of the file format:
- For YAML, the structure is the indentation hierarchy and the
---document separator. - For Python, the structure is classes, functions, and their docstrings.
- For Kustomize, the structure is the
kustomization.yamlthat references bases, overlays, and patches.
How AST-Aware Chunking Works for YAML
- Parse the file using a structure-aware YAML library like
ruamel.yaml. - Split at resource boundaries. Every
---separator in a multi-document YAML file marks a distinct Kubernetes resource. Each resource becomes its own chunk. - Extract metadata. From each parsed resource, pull out
kind,metadata.name, andmetadata.namespaceto store alongside the chunk. This metadata is critical for the retriever to filter and rank results.
What AST-Aware Chunks Look Like (The Goal)
| Chunk | Metadata | Content |
|---|---|---|
| Chunk 1 | kind: Deployment, name: notebook-controller, namespace: kubeflow |
The complete, intact Deployment YAML |
| Chunk 2 | kind: Service, name: notebook-controller-service, namespace: kubeflow |
The complete, intact Service YAML |
Now when a user asks about resource limits for the notebook controller, the retriever matches on the Deployment chunk, which contains the full, unbroken context. The LLM can give a precise, correct answer.
Handling Large Resources
Some Kubernetes resources (e.g., a CRD definition with 500+ lines) may be too large for a single embedding vector to represent well. In these cases, a secondary split is applied:
- First, split at the resource boundary (AST-level).
- If a single resource exceeds a threshold (e.g., 2000 characters), a carefully scoped text splitter is applied within that resource, ensuring the resource's
kindandmetadataheader is prepended to every sub-chunk so context is never lost.
Architecture: The KFP Pipeline
The pipeline consists of three KFP components, optimized for cloud-native execution:
┌──────────────────┐ ┌───────────────────────┐ ┌──────────────────┐
│ Component 1 │────▶│ Component 2 │────▶│ Component 3 │
│ Download Code │ │ Semantic Splitter │ │ Embed & Store │
└──────────────────┘ └───────────────────────┘ └──────────────────┘
Component 1: download_github_code
What it does: Recursively fetches source files matches specific extensions from the GitHub Contents API.
Key Features:
- Branch Pinning: Uses the
refparameter to target specific releases (e.g.,v1.9-branch). - Smart Rate-Limiting: Monitors
X-RateLimit-Remainingand pauses execution automatically to avoid API bans. - Selective Sync: Fetches only the files designated for indexing (e.g.,
.yaml,.py), keeping the KFP workspace small and fast.
Component 2: chunk_and_embed_code (Semantic Splitting)
What it does: Instead of rigid AST parsing, this component uses language-aware regex separators (e.g., \ndef , \nclass , ---) to split files at logical boundaries.
| File Type | Splitting Strategy |
|---|---|
.yaml / .yml |
Split on --- document boundaries. Extracts kind, name, namespace from the header. |
.py |
Split on class and def boundaries. Preserves function signatures and docstrings within the chunk context. |
.go / .sh |
Semantic splitting at block boundaries to prevent mid-logic truncation. |
Architectural Decision: API + Semantic Splitting vs. Clone + AST
While cloning the entire repo and building an Abstract Syntax Tree (AST) sounds precise, the design utilizes an API + Semantic Splitting model for several critical engineering reasons:
1. Language Agnosticism (Avoiding the "AST Trap")
- AST Problem: To parse an AST, you need a language-specific interpreter (e.g., a Python runtime for
.py, a Go parser for.go). In a repo likekubeflow/manifests, you’d need 5+ parsing environments in a single container. - The Solution: Language-aware semantic splitting uses structural regex patterns. It respects class and function boundaries without needing to "compile" the file, making it robust against multiple languages and even slightly invalid/templated syntax.
2. Resource & Storage Efficiency
- Clone Problem: Even a shallow clone downloads the
.gitdirectory, which can be massive. In Kubeflow, every byte must be stored in the pipeline's artifact storage (e.g., MinIO). - The Solution: The GitHub API enables the retrieval of only the 5MB of code required, skipping unnecessary history. This makes the pipeline significantly more efficient and faster.
3. RAG Optimization (Overlap & Context)
- AST Problem: AST nodes are discrete. If a function is 5,000 lines long, a node-based parser can't easily break it into searchable chunks without losing the "node" structure.
- The Solution: Text-based splitting allows for chunk overlap. This ensures that if a query matches the end of one block and the start of another, the retriever captures both, providing superior context for the LLM.
4. Resilience to Broken/Templated Code
- AST Problem: Strict parsers crash if a YAML file contains Helm template tags (e.g.,
{{ .Values.name }}) because it's technically "invalid YAML." - The Solution: The regex splitter treats the file as structure-aware text, gracefully handling templates and partial files that an AST parser would reject.
Component 2: chunk_and_embed_code (Semantic Splitting)
What it does: Instead of rigid AST parsing, this component uses language-aware regex separators (e.g., \ndef , \nclass , ---) to split files at logical boundaries.
| File Type | Splitting Strategy |
|---|---|
.yaml / .yml |
Split on --- document boundaries. Extracts kind, name, namespace from the header. |
.py |
Split on class and def boundaries. Preserves function signatures and docstrings within the chunk context. |
.go / .sh |
Semantic splitting at block boundaries to prevent mid-logic truncation. |
Outputs:
- A dataset of chunks, where each chunk has:
content_text(str): The raw text of the chunk.file_path(str): The path within the repository.resource_kind(str): e.g.,"Deployment","Service","function"(empty for non-structured files).resource_name(str): e.g.,"notebook-controller".repo_name(str): The source repository.branch(str): The branch/version this was ingested from.
Why a Milvus Partition (Not a Separate Collection)
Code embeddings are stored as a partition within the existing docs_rag collection, rather than creating a separate code_rag collection. Reasoning:
What is a partition?
A Milvus partition is a logical subdivision within a single collection. All partitions share the same schema (same fields, same vector dimensions, same index type). Think of it like a table with a category column - partitions let you efficiently filter by that category during search.
Why partitions are the right choice here:
- Unified search. When the agent doesn't know in advance whether the answer lives in docs or code, it can search across all partitions in a single query. With separate collections, you'd need two separate API calls and manual result merging.
- Same schema. Both docs chunks and code chunks share the same fundamental structure:
content_text,embedding,citation_url,file_path. Code chunks simply add a few extra metadata fields (resource_kind,resource_name), which are nullable for doc chunks. - Efficient filtering. You can still search only code or only docs by specifying a partition filter in the search query - so you don't lose any precision.
Partition scheme:
| Partition Name | Contents |
|---|---|
documentation |
Existing prose docs from kubeflow/website |
code_manifests |
YAML/Kustomize from kubeflow/manifests |
code_pipelines |
Source from kubeflow/pipelines (future) |
code_notebooks |
Source from kubeflow/notebooks (future) |
Target Repositories
The implementation starts with kubeflow/manifests because it is the most impactful—it is the literal source of truth for how every Kubeflow component is deployed. The pipeline is designed to be generalizable to any repository by changing the pipeline parameters.
| Repository | Priority | What It Contains |
|---|---|---|
kubeflow/manifests |
P0 (Start here) | Kustomize overlays for deploying all Kubeflow components |
kubeflow/pipelines |
P1 | KFP SDK, backend API, and compiler |
kubeflow/notebooks |
P1 | Notebook Controller, PodDefaults, RBAC |
Implementation Issues (The Sub-Tasks)
Once this design is approved, the following granular issues will be created:
- Issue 1: Implement KFP Component -
download_github_code(REST API with rate-limit handling) - Issue 2: Implement KFP Component -
chunk_and_embed_code(Language-aware semantic splitting) - Issue 3: Wire all components into main pipeline and verify E2E with local Milvus
Open Questions for Community Discussion
- Kustomize resolution: Should the pipeline ingest the raw individual YAML files, or should it run
kustomize buildfirst to produce the fully-resolved manifests? Runningkustomize buildgives the "final truth" of what gets applied to the cluster, but loses the overlay/base decomposition that helps users understand the structure. - Branch strategy: Should the pipeline ingest only the default branch, or should it ingest tagged release branches (e.g.,
v1.9-branch) to support version-specific questions? - Python source files: For Phase 1, should the scope be limited strictly to YAML/Kustomize, or should it also include Python source parsing (e.g., the KFP SDK)?
cc @chasecadet @tarekabouzeid - Requesting feedback on this design before breaking it into actionable implementation issues.