[DESIGN PROPOSAL] : AST Code Ingestion Pipeline for Kubeflow Repositories

**Reference:** [GSoC 2026 Agentic RAG Specification - Phase 1: Ingestion & Foundation](gsoc2026_agentic_rag.md)

---

## The Problem

Today, the `docs-agent` can answer questions about Kubeflow's *prose documentation*. The existing KFP pipeline (`pipelines/kubeflow-pipeline.py`) crawls `kubeflow/website` for Markdown files, and was recently extended to also ingest GitHub Issues and their comments in [#8](https://github.com/kubeflow/docs-agent/pull/8) - giving the agent access to real-world troubleshooting threads. The issues ingest component has not been wired into the pipeline yet. It will be fixed by [this PR](https://github.com/kubeflow/docs-agent/pull/101), which is awaiting review. Both data sources are stored in Milvus.

But Kubeflow users frequently ask questions whose answers live in **code**, not documentation:

> *"What are the default resource limits for the Notebook Controller?"*
> *"How is the Katib controller's RBAC configured?"*
> *"What environment variables does the KFP API server expect?"*

The answers to all of these exist in the actual YAML manifests, Python source files, and Kustomize overlays inside repositories like `kubeflow/manifests`, `kubeflow/pipelines`, `kubeflow/notebooks`, etc. These files have **never been ingested** into the vector database.

The natural evolution of the ingestion layer is: **Docs → Issues → Code**. With documentation and issues already covered, this proposal describes the next step: a new Kubeflow Pipeline that will ingest code repositories - starting with `kubeflow/manifests` and make them searchable by the docs-agent.

---

## Why the Existing Pipeline Cannot be Reused

The existing documentation pipeline uses `RecursiveCharacterTextSplitter` from LangChain. This chunker works by blindly slicing text at a fixed character count (e.g., every 1000 characters), with some overlap.

This works perfectly for English prose. A paragraph that gets split across two chunks is still mostly readable.

**But it destroys code.**

### Example: What Happens When You Naively Chunk YAML

Consider this simplified multi-resource YAML file from `kubeflow/manifests`:

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: notebook-controller
  namespace: kubeflow
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: manager
          image: kubeflownotebookswg/notebook-controller
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
          env:
            - name: CLUSTER_DOMAIN
              value: cluster.local
---
apiVersion: v1
kind: Service
metadata:
  name: notebook-controller-service
  namespace: kubeflow
spec:
  ports:
    - port: 443
```

If a naive 1000-character chunker splits this file, it will produce something like:

| Chunk       | Content                                                                                          |
| ----------- | ------------------------------------------------------------------------------------------------ |
| **Chunk 1** | Lines 1–15: The Deployment header, metadata, and the start of `resources:`                       |
| **Chunk 2** | Lines 16–30: `limits: cpu: "1"`, the env vars, the `---` separator, and the start of the Service |

**Chunk 2 is now broken.** It contains `limits: cpu: "1"` but it has completely lost the context that this belongs to a `Deployment` named `notebook-controller`. Worse, the chunk also mixes the tail of the Deployment with the head of a completely unrelated Service resource. 

---

##  AST-Aware Chunking

** Abstract Syntax Tree.** It is a tree-shaped representation of a file's *structure*, not just its raw text.

Instead of slicing at character boundaries, an AST-aware chunker **understands the grammar** of the file format:

- For **YAML**, the structure is the indentation hierarchy and the `---` document separator.
- For **Python**, the structure is classes, functions, and their docstrings.
- For **Kustomize**, the structure is the `kustomization.yaml` that references bases, overlays, and patches.

### How AST-Aware Chunking Works for YAML

1. **Parse** the file using a structure-aware YAML library like `ruamel.yaml`.
2. **Split at resource boundaries.** Every `---` separator in a multi-document YAML file marks a distinct Kubernetes resource. Each resource becomes its own chunk.
3. **Extract metadata.** From each parsed resource, pull out `kind`, `metadata.name`, and `metadata.namespace` to store alongside the chunk. This metadata is critical for the retriever to filter and rank results.

### What AST-Aware Chunks Look Like (The Goal)

| Chunk       | Metadata                                                                    | Content                                  |
| ----------- | --------------------------------------------------------------------------- | ---------------------------------------- |
| **Chunk 1** | `kind: Deployment`, `name: notebook-controller`, `namespace: kubeflow`      | The **complete, intact** Deployment YAML |
| **Chunk 2** | `kind: Service`, `name: notebook-controller-service`, `namespace: kubeflow` | The **complete, intact** Service YAML    |

Now when a user asks about resource limits for the notebook controller, the retriever matches on the Deployment chunk, which contains the **full, unbroken context**. The LLM can give a precise, correct answer.

### Handling Large Resources

Some Kubernetes resources (e.g., a CRD definition with 500+ lines) may be too large for a single embedding vector to represent well. In these cases, a secondary split is applied:

1. First, split at the resource boundary (AST-level).
2. If a single resource exceeds a threshold (e.g., 2000 characters), a carefully scoped text splitter is applied *within* that resource, ensuring the resource's `kind` and `metadata` header is prepended to every sub-chunk so context is never lost.

---

## Architecture: The KFP Pipeline

The pipeline consists of three KFP components, optimized for cloud-native execution:

```
┌──────────────────┐     ┌───────────────────────┐     ┌──────────────────┐
│   Component 1    │────▶│     Component 2       │────▶│   Component 3    │
│  Download Code   │     │  Semantic Splitter    │     │  Embed & Store   │
└──────────────────┘     └───────────────────────┘     └──────────────────┘
```

### Component 1: `download_github_code`

**What it does:** Recursively fetches source files matches specific extensions from the GitHub Contents API.

**Key Features:**
- **Branch Pinning**: Uses the `ref` parameter to target specific releases (e.g., `v1.9-branch`).
- **Smart Rate-Limiting**: Monitors `X-RateLimit-Remaining` and pauses execution automatically to avoid API bans.
- **Selective Sync**: Fetches only the files designated for indexing (e.g., `.yaml`, `.py`), keeping the KFP workspace small and fast.

### Component 2: `chunk_and_embed_code` (Semantic Splitting)

**What it does:** Instead of rigid AST parsing, this component uses language-aware regex separators (e.g., `\ndef `, `\nclass `, `---`) to split files at logical boundaries.

| File Type        | Splitting Strategy                                                                                               |
| ---------------- | ---------------------------------------------------------------------------------------------------------------- |
| `.yaml` / `.yml` | Split on `---` document boundaries. Extracts `kind`, `name`, `namespace` from the header.                        |
| `.py`            | Split on `class` and `def` boundaries. Preserves function signatures and docstrings within the chunk context.      |
| `.go` / `.sh`    | Semantic splitting at block boundaries to prevent mid-logic truncation.                                           |

---

## Architectural Decision: API + Semantic Splitting vs. Clone + AST

While cloning the entire repo and building an Abstract Syntax Tree (AST) sounds precise, the design utilizes an API + Semantic Splitting model for several critical engineering reasons:

### 1. Language Agnosticism (Avoiding the "AST Trap")
- **AST Problem**: To parse an AST, you need a language-specific interpreter (e.g., a Python runtime for `.py`, a Go parser for `.go`). In a repo like `kubeflow/manifests`, you’d need 5+ parsing environments in a single container.
- **The Solution**: Language-aware semantic splitting uses structural regex patterns. It respects class and function boundaries without needing to "compile" the file, making it robust against multiple languages and even slightly invalid/templated syntax.

### 2. Resource & Storage Efficiency
- **Clone Problem**: Even a shallow clone downloads the `.git` directory, which can be massive. In Kubeflow, every byte must be stored in the pipeline's artifact storage (e.g., MinIO).
- **The Solution**: The GitHub API enables the retrieval of **only** the 5MB of code required, skipping unnecessary history. This makes the pipeline significantly more efficient and faster.

### 3. RAG Optimization (Overlap & Context)
- **AST Problem**: AST nodes are discrete. If a function is 5,000 lines long, a node-based parser can't easily break it into searchable chunks without losing the "node" structure.
- **The Solution**: Text-based splitting allows for **chunk overlap**. This ensures that if a query matches the end of one block and the start of another, the retriever captures both, providing superior context for the LLM.

### 4. Resilience to Broken/Templated Code
- **AST Problem**: Strict parsers crash if a YAML file contains Helm template tags (e.g., `{{ .Values.name }}`) because it's technically "invalid YAML."
- **The Solution**: The regex splitter treats the file as structure-aware text, gracefully handling templates and partial files that an AST parser would reject.

---

## Component 2: `chunk_and_embed_code` (Semantic Splitting)

**What it does:** Instead of rigid AST parsing, this component uses language-aware regex separators (e.g., `\ndef `, `\nclass `, `---`) to split files at logical boundaries.

| File Type        | Splitting Strategy                                                                                               |
| ---------------- | ---------------------------------------------------------------------------------------------------------------- |
| `.yaml` / `.yml` | Split on `---` document boundaries. Extracts `kind`, `name`, `namespace` from the header.                        |
| `.py`            | Split on `class` and `def` boundaries. Preserves function signatures and docstrings within the chunk context.      |
| `.go` / `.sh`    | Semantic splitting at block boundaries to prevent mid-logic truncation.                                           |

**Outputs:**
- A dataset of chunks, where each chunk has:
  - `content_text` (str): The raw text of the chunk.
  - `file_path` (str): The path within the repository.
  - `resource_kind` (str): e.g., `"Deployment"`, `"Service"`, `"function"` (empty for non-structured files).
  - `resource_name` (str): e.g., `"notebook-controller"`.
  - `repo_name` (str): The source repository.
  - `branch` (str): The branch/version this was ingested from.



#### Why a Milvus Partition (Not a Separate Collection)

Code embeddings are stored as a **partition within the existing `docs_rag` collection**, rather than creating a separate `code_rag` collection. Reasoning:

**What is a partition?**
A Milvus partition is a logical subdivision within a single collection. All partitions share the same schema (same fields, same vector dimensions, same index type). Think of it like a table with a `category` column - partitions let you efficiently filter by that category during search.

**Why partitions are the right choice here:**

1. **Unified search.** When the agent doesn't know in advance whether the answer lives in docs or code, it can search across *all* partitions in a single query. With separate collections, you'd need two separate API calls and manual result merging.
2. **Same schema.** Both docs chunks and code chunks share the same fundamental structure: `content_text`, `embedding`, `citation_url`, `file_path`. Code chunks simply add a few extra metadata fields (`resource_kind`, `resource_name`), which are nullable for doc chunks.
3. **Efficient filtering.** You *can* still search only code or only docs by specifying a partition filter in the search query - so you don't lose any precision.

**Partition scheme:**

| Partition Name   | Contents                                    |
| ---------------- | ------------------------------------------- |
| `documentation`  | Existing prose docs from `kubeflow/website` |
| `code_manifests` | YAML/Kustomize from `kubeflow/manifests`    |
| `code_pipelines` | Source from `kubeflow/pipelines` (future)   |
| `code_notebooks` | Source from `kubeflow/notebooks` (future)   |

---

## Target Repositories

The implementation starts with `kubeflow/manifests` because it is the most impactful—it is the literal source of truth for how every Kubeflow component is deployed. The pipeline is designed to be **generalizable** to any repository by changing the pipeline parameters.

| Repository                   | Priority            | What It Contains                                         |
| ---------------------------- | ------------------- | -------------------------------------------------------- |
| `kubeflow/manifests`         | **P0 (Start here)** | Kustomize overlays for deploying all Kubeflow components |
| `kubeflow/pipelines`         | P1                  | KFP SDK, backend API, and compiler                       |
| `kubeflow/notebooks`         | P1                  | Notebook Controller, PodDefaults, RBAC                   |

---

## Implementation Issues (The Sub-Tasks)

Once this design is approved, the following granular issues will be created:

- [x] **Issue 1:** Implement KFP Component - `download_github_code` (REST API with rate-limit handling)
- [x] **Issue 2:** Implement KFP Component - `chunk_and_embed_code` (Language-aware semantic splitting)
- [x] **Issue 3:** Wire all components into main pipeline and verify E2E with local Milvus


---

## Open Questions for Community Discussion

1. **Kustomize resolution:** Should the pipeline ingest the raw individual YAML files, or should it run `kustomize build` first to produce the fully-resolved manifests? Running `kustomize build` gives the "final truth" of what gets applied to the cluster, but loses the overlay/base decomposition that helps users understand the structure.
2. **Branch strategy:** Should the pipeline ingest only the default branch, or should it ingest tagged release branches (e.g., `v1.9-branch`) to support version-specific questions?
3. **Python source files:** For Phase 1, should the scope be limited strictly to YAML/Kustomize, or should it also include Python source parsing (e.g., the KFP SDK)?

---

cc @chasecadet  @tarekabouzeid  - Requesting feedback on this design before breaking it into actionable implementation issues.


Chunk	Content
Chunk 1	Lines 1–15: The Deployment header, metadata, and the start of `resources:`
Chunk 2	Lines 16–30: `limits: cpu: "1"`, the env vars, the `---` separator, and the start of the Service

Chunk	Metadata	Content
Chunk 1	`kind: Deployment`, `name: notebook-controller`, `namespace: kubeflow`	The complete, intact Deployment YAML
Chunk 2	`kind: Service`, `name: notebook-controller-service`, `namespace: kubeflow`	The complete, intact Service YAML

File Type	Splitting Strategy
`.yaml` / `.yml`	Split on `---` document boundaries. Extracts `kind`, `name`, `namespace` from the header.
`.py`	Split on `class` and `def` boundaries. Preserves function signatures and docstrings within the chunk context.
`.go` / `.sh`	Semantic splitting at block boundaries to prevent mid-logic truncation.

File Type	Splitting Strategy
`.yaml` / `.yml`	Split on `---` document boundaries. Extracts `kind`, `name`, `namespace` from the header.
`.py`	Split on `class` and `def` boundaries. Preserves function signatures and docstrings within the chunk context.
`.go` / `.sh`	Semantic splitting at block boundaries to prevent mid-logic truncation.

Partition Name	Contents
`documentation`	Existing prose docs from `kubeflow/website`
`code_manifests`	YAML/Kustomize from `kubeflow/manifests`
`code_pipelines`	Source from `kubeflow/pipelines` (future)
`code_notebooks`	Source from `kubeflow/notebooks` (future)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DESIGN PROPOSAL] : AST Code Ingestion Pipeline for Kubeflow Repositories #120

The Problem

Why the Existing Pipeline Cannot be Reused

Example: What Happens When You Naively Chunk YAML

AST-Aware Chunking

How AST-Aware Chunking Works for YAML

What AST-Aware Chunks Look Like (The Goal)

Handling Large Resources

Architecture: The KFP Pipeline

Component 1: `download_github_code`

Component 2: `chunk_and_embed_code` (Semantic Splitting)

Architectural Decision: API + Semantic Splitting vs. Clone + AST

1. Language Agnosticism (Avoiding the "AST Trap")

2. Resource & Storage Efficiency

3. RAG Optimization (Overlap & Context)

4. Resilience to Broken/Templated Code

Component 2: `chunk_and_embed_code` (Semantic Splitting)

Why a Milvus Partition (Not a Separate Collection)

Target Repositories

Implementation Issues (The Sub-Tasks)

Open Questions for Community Discussion

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Repository	Priority	What It Contains
`kubeflow/manifests`	P0 (Start here)	Kustomize overlays for deploying all Kubeflow components
`kubeflow/pipelines`	P1	KFP SDK, backend API, and compiler
`kubeflow/notebooks`	P1	Notebook Controller, PodDefaults, RBAC

[DESIGN PROPOSAL] : AST Code Ingestion Pipeline for Kubeflow Repositories #120

Description

The Problem

Why the Existing Pipeline Cannot be Reused

Example: What Happens When You Naively Chunk YAML

AST-Aware Chunking

How AST-Aware Chunking Works for YAML

What AST-Aware Chunks Look Like (The Goal)

Handling Large Resources

Architecture: The KFP Pipeline

Component 1: download_github_code

Component 2: chunk_and_embed_code (Semantic Splitting)

Architectural Decision: API + Semantic Splitting vs. Clone + AST

1. Language Agnosticism (Avoiding the "AST Trap")

2. Resource & Storage Efficiency

3. RAG Optimization (Overlap & Context)

4. Resilience to Broken/Templated Code

Component 2: chunk_and_embed_code (Semantic Splitting)

Why a Milvus Partition (Not a Separate Collection)

Target Repositories

Implementation Issues (The Sub-Tasks)

Open Questions for Community Discussion

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Component 1: `download_github_code`

Component 2: `chunk_and_embed_code` (Semantic Splitting)

Component 2: `chunk_and_embed_code` (Semantic Splitting)