Skip to content

[DESIGN PROPOSAL] : AST Code Ingestion Pipeline for Kubeflow Repositories #120

@haroon0x

Description

@haroon0x

Reference: GSoC 2026 Agentic RAG Specification - Phase 1: Ingestion & Foundation


The Problem

Today, the docs-agent can answer questions about Kubeflow's prose documentation. The existing KFP pipeline (pipelines/kubeflow-pipeline.py) crawls kubeflow/website for Markdown files, and was recently extended to also ingest GitHub Issues and their comments in #8 - giving the agent access to real-world troubleshooting threads. The issues ingest component has not been wired into the pipeline yet. It will be fixed by this PR, which is awaiting review. Both data sources are stored in Milvus.

But Kubeflow users frequently ask questions whose answers live in code, not documentation:

"What are the default resource limits for the Notebook Controller?"
"How is the Katib controller's RBAC configured?"
"What environment variables does the KFP API server expect?"

The answers to all of these exist in the actual YAML manifests, Python source files, and Kustomize overlays inside repositories like kubeflow/manifests, kubeflow/pipelines, kubeflow/notebooks, etc. These files have never been ingested into the vector database.

The natural evolution of the ingestion layer is: Docs → Issues → Code. With documentation and issues already covered, this proposal describes the next step: a new Kubeflow Pipeline that will ingest code repositories - starting with kubeflow/manifests and make them searchable by the docs-agent.


Why the Existing Pipeline Cannot be Reused

The existing documentation pipeline uses RecursiveCharacterTextSplitter from LangChain. This chunker works by blindly slicing text at a fixed character count (e.g., every 1000 characters), with some overlap.

This works perfectly for English prose. A paragraph that gets split across two chunks is still mostly readable.

But it destroys code.

Example: What Happens When You Naively Chunk YAML

Consider this simplified multi-resource YAML file from kubeflow/manifests:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: notebook-controller
  namespace: kubeflow
spec:
  replicas: 1
  template:
    spec:
      containers:
        - name: manager
          image: kubeflownotebookswg/notebook-controller
          resources:
            limits:
              cpu: "1"
              memory: "1Gi"
          env:
            - name: CLUSTER_DOMAIN
              value: cluster.local
---
apiVersion: v1
kind: Service
metadata:
  name: notebook-controller-service
  namespace: kubeflow
spec:
  ports:
    - port: 443

If a naive 1000-character chunker splits this file, it will produce something like:

Chunk Content
Chunk 1 Lines 1–15: The Deployment header, metadata, and the start of resources:
Chunk 2 Lines 16–30: limits: cpu: "1", the env vars, the --- separator, and the start of the Service

Chunk 2 is now broken. It contains limits: cpu: "1" but it has completely lost the context that this belongs to a Deployment named notebook-controller. Worse, the chunk also mixes the tail of the Deployment with the head of a completely unrelated Service resource.


AST-Aware Chunking

** Abstract Syntax Tree.** It is a tree-shaped representation of a file's structure, not just its raw text.

Instead of slicing at character boundaries, an AST-aware chunker understands the grammar of the file format:

  • For YAML, the structure is the indentation hierarchy and the --- document separator.
  • For Python, the structure is classes, functions, and their docstrings.
  • For Kustomize, the structure is the kustomization.yaml that references bases, overlays, and patches.

How AST-Aware Chunking Works for YAML

  1. Parse the file using a structure-aware YAML library like ruamel.yaml.
  2. Split at resource boundaries. Every --- separator in a multi-document YAML file marks a distinct Kubernetes resource. Each resource becomes its own chunk.
  3. Extract metadata. From each parsed resource, pull out kind, metadata.name, and metadata.namespace to store alongside the chunk. This metadata is critical for the retriever to filter and rank results.

What AST-Aware Chunks Look Like (The Goal)

Chunk Metadata Content
Chunk 1 kind: Deployment, name: notebook-controller, namespace: kubeflow The complete, intact Deployment YAML
Chunk 2 kind: Service, name: notebook-controller-service, namespace: kubeflow The complete, intact Service YAML

Now when a user asks about resource limits for the notebook controller, the retriever matches on the Deployment chunk, which contains the full, unbroken context. The LLM can give a precise, correct answer.

Handling Large Resources

Some Kubernetes resources (e.g., a CRD definition with 500+ lines) may be too large for a single embedding vector to represent well. In these cases, a secondary split is applied:

  1. First, split at the resource boundary (AST-level).
  2. If a single resource exceeds a threshold (e.g., 2000 characters), a carefully scoped text splitter is applied within that resource, ensuring the resource's kind and metadata header is prepended to every sub-chunk so context is never lost.

Architecture: The KFP Pipeline

The pipeline consists of three KFP components, optimized for cloud-native execution:

┌──────────────────┐     ┌───────────────────────┐     ┌──────────────────┐
│   Component 1    │────▶│     Component 2       │────▶│   Component 3    │
│  Download Code   │     │  Semantic Splitter    │     │  Embed & Store   │
└──────────────────┘     └───────────────────────┘     └──────────────────┘

Component 1: download_github_code

What it does: Recursively fetches source files matches specific extensions from the GitHub Contents API.

Key Features:

  • Branch Pinning: Uses the ref parameter to target specific releases (e.g., v1.9-branch).
  • Smart Rate-Limiting: Monitors X-RateLimit-Remaining and pauses execution automatically to avoid API bans.
  • Selective Sync: Fetches only the files designated for indexing (e.g., .yaml, .py), keeping the KFP workspace small and fast.

Component 2: chunk_and_embed_code (Semantic Splitting)

What it does: Instead of rigid AST parsing, this component uses language-aware regex separators (e.g., \ndef , \nclass , ---) to split files at logical boundaries.

File Type Splitting Strategy
.yaml / .yml Split on --- document boundaries. Extracts kind, name, namespace from the header.
.py Split on class and def boundaries. Preserves function signatures and docstrings within the chunk context.
.go / .sh Semantic splitting at block boundaries to prevent mid-logic truncation.

Architectural Decision: API + Semantic Splitting vs. Clone + AST

While cloning the entire repo and building an Abstract Syntax Tree (AST) sounds precise, the design utilizes an API + Semantic Splitting model for several critical engineering reasons:

1. Language Agnosticism (Avoiding the "AST Trap")

  • AST Problem: To parse an AST, you need a language-specific interpreter (e.g., a Python runtime for .py, a Go parser for .go). In a repo like kubeflow/manifests, you’d need 5+ parsing environments in a single container.
  • The Solution: Language-aware semantic splitting uses structural regex patterns. It respects class and function boundaries without needing to "compile" the file, making it robust against multiple languages and even slightly invalid/templated syntax.

2. Resource & Storage Efficiency

  • Clone Problem: Even a shallow clone downloads the .git directory, which can be massive. In Kubeflow, every byte must be stored in the pipeline's artifact storage (e.g., MinIO).
  • The Solution: The GitHub API enables the retrieval of only the 5MB of code required, skipping unnecessary history. This makes the pipeline significantly more efficient and faster.

3. RAG Optimization (Overlap & Context)

  • AST Problem: AST nodes are discrete. If a function is 5,000 lines long, a node-based parser can't easily break it into searchable chunks without losing the "node" structure.
  • The Solution: Text-based splitting allows for chunk overlap. This ensures that if a query matches the end of one block and the start of another, the retriever captures both, providing superior context for the LLM.

4. Resilience to Broken/Templated Code

  • AST Problem: Strict parsers crash if a YAML file contains Helm template tags (e.g., {{ .Values.name }}) because it's technically "invalid YAML."
  • The Solution: The regex splitter treats the file as structure-aware text, gracefully handling templates and partial files that an AST parser would reject.

Component 2: chunk_and_embed_code (Semantic Splitting)

What it does: Instead of rigid AST parsing, this component uses language-aware regex separators (e.g., \ndef , \nclass , ---) to split files at logical boundaries.

File Type Splitting Strategy
.yaml / .yml Split on --- document boundaries. Extracts kind, name, namespace from the header.
.py Split on class and def boundaries. Preserves function signatures and docstrings within the chunk context.
.go / .sh Semantic splitting at block boundaries to prevent mid-logic truncation.

Outputs:

  • A dataset of chunks, where each chunk has:
    • content_text (str): The raw text of the chunk.
    • file_path (str): The path within the repository.
    • resource_kind (str): e.g., "Deployment", "Service", "function" (empty for non-structured files).
    • resource_name (str): e.g., "notebook-controller".
    • repo_name (str): The source repository.
    • branch (str): The branch/version this was ingested from.

Why a Milvus Partition (Not a Separate Collection)

Code embeddings are stored as a partition within the existing docs_rag collection, rather than creating a separate code_rag collection. Reasoning:

What is a partition?
A Milvus partition is a logical subdivision within a single collection. All partitions share the same schema (same fields, same vector dimensions, same index type). Think of it like a table with a category column - partitions let you efficiently filter by that category during search.

Why partitions are the right choice here:

  1. Unified search. When the agent doesn't know in advance whether the answer lives in docs or code, it can search across all partitions in a single query. With separate collections, you'd need two separate API calls and manual result merging.
  2. Same schema. Both docs chunks and code chunks share the same fundamental structure: content_text, embedding, citation_url, file_path. Code chunks simply add a few extra metadata fields (resource_kind, resource_name), which are nullable for doc chunks.
  3. Efficient filtering. You can still search only code or only docs by specifying a partition filter in the search query - so you don't lose any precision.

Partition scheme:

Partition Name Contents
documentation Existing prose docs from kubeflow/website
code_manifests YAML/Kustomize from kubeflow/manifests
code_pipelines Source from kubeflow/pipelines (future)
code_notebooks Source from kubeflow/notebooks (future)

Target Repositories

The implementation starts with kubeflow/manifests because it is the most impactful—it is the literal source of truth for how every Kubeflow component is deployed. The pipeline is designed to be generalizable to any repository by changing the pipeline parameters.

Repository Priority What It Contains
kubeflow/manifests P0 (Start here) Kustomize overlays for deploying all Kubeflow components
kubeflow/pipelines P1 KFP SDK, backend API, and compiler
kubeflow/notebooks P1 Notebook Controller, PodDefaults, RBAC

Implementation Issues (The Sub-Tasks)

Once this design is approved, the following granular issues will be created:

  • Issue 1: Implement KFP Component - download_github_code (REST API with rate-limit handling)
  • Issue 2: Implement KFP Component - chunk_and_embed_code (Language-aware semantic splitting)
  • Issue 3: Wire all components into main pipeline and verify E2E with local Milvus

Open Questions for Community Discussion

  1. Kustomize resolution: Should the pipeline ingest the raw individual YAML files, or should it run kustomize build first to produce the fully-resolved manifests? Running kustomize build gives the "final truth" of what gets applied to the cluster, but loses the overlay/base decomposition that helps users understand the structure.
  2. Branch strategy: Should the pipeline ingest only the default branch, or should it ingest tagged release branches (e.g., v1.9-branch) to support version-specific questions?
  3. Python source files: For Phase 1, should the scope be limited strictly to YAML/Kustomize, or should it also include Python source parsing (e.g., the KFP SDK)?

cc @chasecadet @tarekabouzeid - Requesting feedback on this design before breaking it into actionable implementation issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions