Proposal: Native KFP Evaluation Pipeline with RAGAS

Following the GSoC spec update regarding feedback loops and golden 
datasets, I'd like to propose building a native Kubeflow Evaluation 
Pipeline for the docs-agent.

## What

Build an automated evaluation pipeline using Kubeflow Pipelines (KFP) 
and RAGAS to continuously benchmark the agent's retrieval accuracy 
and generation quality against a golden dataset.

## Why

Currently there is no automated, CI/CD-ready way to measure whether 
a change to the pipeline, prompt, or retrieval logic made the agent 
better or worse.

A native KFP eval pipeline solves this by:
- Running automatically on every significant change
- Producing mathematically measurable scores for generation quality
- Fitting naturally into the existing Kubeflow infrastructure 
  alongside the ingestion pipeline

## Pipeline Architecture

3 sequential KFP components:

**Component 1 — `load_golden_dataset`**
Loads curated Q&A pairs from storage and prepares them as pipeline input.

**Component 2 — `run_inference`**
Sends each question to the live docs-agent API, collecting the 
generated answers and retrieved context chunks.

**Component 3 — `evaluate_with_ragas`**
Scores each response using RAGAS metrics:
- Faithfulness — did the answer hallucinate?
- Answer Relevancy — did it actually answer the question?
- Context Recall — did it retrieve the optimal Milvus chunks?

Outputs a results report to the Kubeflow artifact store.

## Proposed Project Structure
```
eval-pipeline/
├── components/
│   ├── load_golden_dataset.py    # KFP component — loads Q&A pairs
│   ├── run_inference.py          # KFP component — queries live agent
│   └── evaluate_with_ragas.py   # KFP component — RAGAS scoring
├── pipeline/
│   ├── eval_pipeline.py          # Assembles components via @dsl.pipeline
│   └── run.py                    # Entry point to trigger pipeline run
├── dataset/
│   └── golden_dataset.json       # Curated Kubeflow Q&A pairs
├── tests/
│   ├── test_load_dataset.py
│   ├── test_evaluate_ragas.py
│   └── fixtures/
└── docs/
    └── eval_pipeline.md          # Setup and usage guide
```

## Question

Is this architectural direction aligned with the GSoC 2026 roadmap? 
I am currently exploring this as part of my GSoC 2026 application 
and actively working towards building this out.

@jaiakash @chasecadet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Native KFP Evaluation Pipeline with RAGAS #127

What

Why

Pipeline Architecture

Proposed Project Structure

Question

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Native KFP Evaluation Pipeline with RAGAS #127

Description

What

Why

Pipeline Architecture

Proposed Project Structure

Question

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions