Skip to content

Proposal: Native KFP Evaluation Pipeline with RAGAS #127

@sanjayy0612

Description

@sanjayy0612

Following the GSoC spec update regarding feedback loops and golden
datasets, I'd like to propose building a native Kubeflow Evaluation
Pipeline for the docs-agent.

What

Build an automated evaluation pipeline using Kubeflow Pipelines (KFP)
and RAGAS to continuously benchmark the agent's retrieval accuracy
and generation quality against a golden dataset.

Why

Currently there is no automated, CI/CD-ready way to measure whether
a change to the pipeline, prompt, or retrieval logic made the agent
better or worse.

A native KFP eval pipeline solves this by:

  • Running automatically on every significant change
  • Producing mathematically measurable scores for generation quality
  • Fitting naturally into the existing Kubeflow infrastructure
    alongside the ingestion pipeline

Pipeline Architecture

3 sequential KFP components:

Component 1 — load_golden_dataset
Loads curated Q&A pairs from storage and prepares them as pipeline input.

Component 2 — run_inference
Sends each question to the live docs-agent API, collecting the
generated answers and retrieved context chunks.

Component 3 — evaluate_with_ragas
Scores each response using RAGAS metrics:

  • Faithfulness — did the answer hallucinate?
  • Answer Relevancy — did it actually answer the question?
  • Context Recall — did it retrieve the optimal Milvus chunks?

Outputs a results report to the Kubeflow artifact store.

Proposed Project Structure

eval-pipeline/
├── components/
│   ├── load_golden_dataset.py    # KFP component — loads Q&A pairs
│   ├── run_inference.py          # KFP component — queries live agent
│   └── evaluate_with_ragas.py   # KFP component — RAGAS scoring
├── pipeline/
│   ├── eval_pipeline.py          # Assembles components via @dsl.pipeline
│   └── run.py                    # Entry point to trigger pipeline run
├── dataset/
│   └── golden_dataset.json       # Curated Kubeflow Q&A pairs
├── tests/
│   ├── test_load_dataset.py
│   ├── test_evaluate_ragas.py
│   └── fixtures/
└── docs/
    └── eval_pipeline.md          # Setup and usage guide

Question

Is this architectural direction aligned with the GSoC 2026 roadmap?
I am currently exploring this as part of my GSoC 2026 application
and actively working towards building this out.

@jaiakash @chasecadet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions