-
Notifications
You must be signed in to change notification settings - Fork 71
Description
Following the GSoC spec update regarding feedback loops and golden
datasets, I'd like to propose building a native Kubeflow Evaluation
Pipeline for the docs-agent.
What
Build an automated evaluation pipeline using Kubeflow Pipelines (KFP)
and RAGAS to continuously benchmark the agent's retrieval accuracy
and generation quality against a golden dataset.
Why
Currently there is no automated, CI/CD-ready way to measure whether
a change to the pipeline, prompt, or retrieval logic made the agent
better or worse.
A native KFP eval pipeline solves this by:
- Running automatically on every significant change
- Producing mathematically measurable scores for generation quality
- Fitting naturally into the existing Kubeflow infrastructure
alongside the ingestion pipeline
Pipeline Architecture
3 sequential KFP components:
Component 1 — load_golden_dataset
Loads curated Q&A pairs from storage and prepares them as pipeline input.
Component 2 — run_inference
Sends each question to the live docs-agent API, collecting the
generated answers and retrieved context chunks.
Component 3 — evaluate_with_ragas
Scores each response using RAGAS metrics:
- Faithfulness — did the answer hallucinate?
- Answer Relevancy — did it actually answer the question?
- Context Recall — did it retrieve the optimal Milvus chunks?
Outputs a results report to the Kubeflow artifact store.
Proposed Project Structure
eval-pipeline/
├── components/
│ ├── load_golden_dataset.py # KFP component — loads Q&A pairs
│ ├── run_inference.py # KFP component — queries live agent
│ └── evaluate_with_ragas.py # KFP component — RAGAS scoring
├── pipeline/
│ ├── eval_pipeline.py # Assembles components via @dsl.pipeline
│ └── run.py # Entry point to trigger pipeline run
├── dataset/
│ └── golden_dataset.json # Curated Kubeflow Q&A pairs
├── tests/
│ ├── test_load_dataset.py
│ ├── test_evaluate_ragas.py
│ └── fixtures/
└── docs/
└── eval_pipeline.md # Setup and usage guide
Question
Is this architectural direction aligned with the GSoC 2026 roadmap?
I am currently exploring this as part of my GSoC 2026 application
and actively working towards building this out.