The official LLM implementation of the Kubeflow Documentation Assistant powered by Retrieval-Augmented Generation (RAG). This repository provides a comprehensive solution for Kubeflow users to search across documentation and get accurate, contextual answers to their queries.
Kubeflow users often struggle to find relevant information across the extensive documentation scattered across different services, components, and repositories. The traditional search approach lacks context and often returns irrelevant results. This documentation assistant addresses these challenges by:
- Semantic Search: Understanding the intent behind queries rather than just keyword matching
- Contextual Responses: Providing answers based on the most relevant documentation chunks
- Real-time Processing: Enabling instant responses through streaming APIs
- Scalable Architecture: Leveraging Kubernetes for automatic scaling and resource management
- 🔍 Intelligent Search: Semantic search across Kubeflow documentation
- 🤖 AI-Powered Responses: Contextual answers using Llama 3.1-8B model
- ⚡ Real-time Streaming: WebSocket and HTTP streaming support
- đź”§ Tool Calling: Automatic documentation lookup when needed
- 📊 Vector Database: Milvus for efficient similarity search
- 🚀 Kubernetes Native: Built for cloud-native environments
- 🔄 Automated ETL: Kubeflow Pipelines for data processing
- Kubernetes cluster (1.20+)
- Helm 3.x
- Kubeflow Pipelines
- GPU nodes (for LLM inference)
- SSL certificate (for HTTPS API)
Milvus is an open-source vector database designed for AI applications. It provides:
- High Performance: Optimized for vector similarity search
- Scalability: Horizontal scaling capabilities
- Multiple Index Types: Support for various vector indexing algorithms
- Cloud Native: Built for Kubernetes environments
- Multiple APIs: REST, gRPC, and Python SDK support
-
Add Helm Repository:
helm repo add milvus https://milvus-io.github.io/milvus-helm/ helm repo update
-
Install Milvus:
helm upgrade --install my-release zilliztech/milvus -n santhosh \ --set cluster.enabled=false \ --set standalone.enabled=true \ --set etcd.replicaCount=1 \ --set etcd.persistence.enabled=false \ --set minio.mode=standalone \ --set minio.replicas=1 \ --set pulsar.enabled=false \ --set pulsarv3.enabled=false \ --set standalone.podAnnotations."sidecar\.istio\.io/inject"="false"
- Standalone Mode: Single-node deployment for development/testing
- Single etcd Replica: Reduced resource usage with
etcd.persistence.enabled=false - Standalone MinIO: Single MinIO instance for object storage
- Disabled Pulsar: Not needed for standalone deployment
- Istio Sidecar Injection: Disabled to avoid networking issues
-
Test Connection:
from pymilvus import connections connections.connect("default", host="my-release-milvus.santhosh.svc.cluster.local", port="19530") print("Connected to Milvus successfully!")
-
External Access (if needed for different clusters):
kubectl expose service my-release-milvus \ --name milvus-external \ --type=NodePort \ --port=19530
The LLM inference is handled by KServe with vLLM backend for high-performance serving.
# manifests/serving-runtime.yaml
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: llm-runtime
namespace: santhosh
spec:
supportedModelFormats:
- name: huggingface
version: "1"
autoSelect: true
containers:
- name: kserve-container
image: kserve/huggingfaceserver:latest-gpu
command: ["python", "-m", "huggingfaceserver"]
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "6"
memory: "24Gi"
nvidia.com/gpu: "1"# manifests/inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama
namespace: santhosh
spec:
predictor:
model:
modelFormat:
name: huggingface
version: "1"
runtime: llm-runtime
args:
- --model_name=llama3.1-8B
- --model_id=RedHatAI/Llama-3.1-8B-Instruct
- --backend=vllm
- --max-model-len=32768
- --gpu-memory-utilization=0.90
- --enable-auto-tool-choice
- --tool-call-parser=llama3_json
- --enable-tool-call-parser
env:
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: huggingface-secret
key: token
- name: CUDA_VISIBLE_DEVICES
value: "0"
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "6"
memory: "24Gi"
nvidia.com/gpu: "1"- Tool Calling: Enabled with
--enable-auto-tool-choiceand--enable-tool-call-parser - Custom Template: vLLM supports custom templates for different model formats
- Resource Allocation: GPU memory utilization set to 90% for optimal performance
- HuggingFace Token: Required for accessing the model
Connection Details:
KSERVE_URL = os.getenv("KSERVE_URL", "http://llama.santhosh.svc.cluster.local/openai/v1/chat/completions")
MODEL = os.getenv("MODEL", "llama3.1-8B")For more details, refer to KServe documentation and vLLM documentation.
The ETL (Extract, Transform, Load) process is implemented as a Kubeflow Pipeline for automated, scalable data processing.
- Infrastructure Management: Kubernetes handles all infrastructure automatically
- Scalability: Auto-scaling based on workload demands
- Reproducibility: Version-controlled pipeline definitions
- Integration: Seamless integration with other Kubeflow components
- CI/CD Ready: Can be triggered via GitHub Actions or other automation tools
The pipeline consists of three main phases:
@dsl.component(
base_image="python:3.9",
packages_to_install=["requests", "beautifulsoup4"]
)
def download_github_directory(
repo_owner: str,
repo_name: str,
directory_path: str,
github_token: str,
github_data: dsl.Output[dsl.Dataset]
):
# Fetches documentation files from GitHub repositories
# Supports .md and .html files
# Handles authentication and recursive directory traversal@dsl.component(
base_image="pytorch/pytorch:2.3.0-cuda12.1-cudnn8-runtime",
packages_to_install=["sentence-transformers", "langchain"]
)
def chunk_and_embed(
github_data: dsl.Input[dsl.Dataset],
repo_name: str,
base_url: str,
chunk_size: int,
chunk_overlap: int,
embedded_data: dsl.Output[dsl.Dataset]
):
# Processes text with aggressive cleaning
# Creates embeddings using sentence-transformers
# Handles chunking with configurable overlap@dsl.component(
base_image="python:3.9",
packages_to_install=["pymilvus", "numpy"]
)
def store_milvus(
embedded_data: dsl.Input[dsl.Dataset],
milvus_host: str,
milvus_port: str,
collection_name: str
):
# Creates Milvus collection with proper schema
# Inserts vectors in batches for efficiency
# Creates indexes for optimal search performanceFor Kubeflow Pipelines to access Milvus, proper RBAC permissions are required:
# Create role for Milvus access
kubectl create role milvus-access \
--namespace santhosh \
--verb=get,list,watch \
--resource=services,endpoints
# Bind role to KFP service account
kubectl create rolebinding kfp-to-milvus-editor \
--namespace santhosh \
--role=milvus-access \
--serviceaccount=kubeflow:default-editorNote: Without these permissions, you'll encounter RBAC errors during the embedding phase.
A better improvement would be using the embedding model as a service where users could call the service instead of installing heavy sentence transformers package every time. This would:
- Reduce pipeline execution time
- Lower resource requirements
- Enable better caching and optimization
- Improve scalability
Two API implementations are provided for different use cases:
Use Case: Real-time chat applications, interactive interfaces
Features:
- Bidirectional communication
- Real-time streaming responses
- Tool call execution with live updates
- Connection management and error handling
Key Components:
async def handle_websocket(websocket, path):
"""Handle WebSocket connections with tool calling support"""
# Manages connection lifecycle
# Handles message routing and tool execution
# Provides real-time streaming responses
async def stream_llm_response(payload, websocket, citations_collector):
"""Stream LLM responses with tool call handling"""
# Processes streaming responses from KServe
# Manages tool call accumulation and execution
# Handles follow-up requests after tool executionUse Case: RESTful integrations, server-to-server communication, web applications
Key Features:
- Dual Response Modes: Both streaming (Server-Sent Events) and non-streaming JSON responses
- RAG Integration: Automatic tool calling for Kubeflow documentation search
- CORS Support: Full cross-origin resource sharing for web applications
- FastAPI Framework: Automatic OpenAPI documentation and type validation
- Production Ready: Health checks, error handling, and Kubernetes integration
- Citation Management: Automatic collection and deduplication of source citations
API Endpoints:
Main Chat Endpoint:
@app.post("/chat")
async def chat(request: ChatRequest):
"""Main chat endpoint with RAG capabilities"""
# Supports both streaming and non-streaming responses
# Handles tool calling and citation collection
# Returns structured JSON responsesHealth Check Endpoint:
@app.get("/health")
async def health_check():
"""Health check for Kubernetes probes"""
# Essential for production deployments
# Used by readiness and liveness probesRequest/Response Models:
class ChatRequest(BaseModel):
message: str
stream: Optional[bool] = True # Default to streaming
# Streaming Response (SSE)
data: {"type": "content", "content": "response text"}
data: {"type": "tool_result", "tool_name": "search_kubeflow_docs", "content": "search results"}
data: {"type": "citations", "citations": ["url1", "url2"]}
data: {"type": "done"}
# Non-streaming Response
{
"response": "Complete response text",
"citations": ["url1", "url2"] # or null if no citations
}Advanced Features:
- Intelligent Tool Calling: Automatically determines when to search documentation based on query context
- Streaming Tool Execution: Real-time tool call execution with live updates
- Citation Tracking: Automatic collection and deduplication of source URLs
- Error Handling: Comprehensive error handling with detailed error messages
- CORS Configuration: Full CORS support for web application integration
- Resource Management: Proper connection pooling and cleanup for Milvus and KServe
Critical: Both APIs require SSL certificates from a trusted Certificate Authority. Without proper SSL certificates, browsers will block WebSocket connections and HTTPS requests.
-
Deploy Milvus and KServe (as described above)
-
Run the Pipeline:
python pipelines/kubeflow-pipeline.py
-
Start the API Server:
# WebSocket API python server/app.py # HTTPS API python server-https/app.py
const ws = new WebSocket('wss://your-domain.com:8000');
ws.onmessage = function(event) {
const data = JSON.parse(event.data);
switch(data.type) {
case 'content':
// Handle streaming content
break;
case 'citations':
// Handle citations
break;
case 'done':
// Handle completion
break;
}
};
ws.send(JSON.stringify({
message: "How do I create a Kubeflow pipeline?"
}));Streaming Request (Server-Sent Events):
curl -X POST "https://your-domain.com/chat" \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{"message": "What is KServe?", "stream": true}'Non-streaming Request (JSON Response):
curl -X POST "https://your-domain.com/chat" \
-H "Content-Type: application/json" \
-d '{"message": "What is KServe?", "stream": false}'JavaScript Integration Example:
// Streaming request
const response = await fetch('https://your-domain.com/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Accept': 'text/event-stream'
},
body: JSON.stringify({
message: 'How do I create a Kubeflow pipeline?',
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n');
for (const line of lines) {
if (line.startsWith('data: ')) {
const data = JSON.parse(line.slice(6));
switch(data.type) {
case 'content':
console.log('Content:', data.content);
break;
case 'tool_result':
console.log('Tool:', data.tool_name, data.content);
break;
case 'citations':
console.log('Citations:', data.citations);
break;
case 'done':
console.log('Response complete');
break;
}
}
}
}Python Integration Example:
import requests
import json
# Non-streaming request
response = requests.post(
'https://your-domain.com/chat',
json={
'message': 'What is KServe?',
'stream': False
}
)
data = response.json()
print(f"Response: {data['response']}")
if data.get('citations'):
print(f"Sources: {data['citations']}")| Variable | Default | Description |
|---|---|---|
KSERVE_URL |
http://llama.santhosh.svc.cluster.local/openai/v1/chat/completions |
KServe endpoint URL |
MODEL |
llama3.1-8B |
Model name |
PORT |
8000 |
API server port |
MILVUS_HOST |
my-release-milvus.santhosh.svc.cluster.local |
Milvus host |
MILVUS_PORT |
19530 |
Milvus port |
MILVUS_COLLECTION |
docs_rag |
Milvus collection name |
EMBEDDING_MODEL |
sentence-transformers/all-mpnet-base-v2 |
Embedding model |
| Parameter | Default | Description |
|---|---|---|
repo_owner |
kubeflow |
GitHub repository owner |
repo_name |
website |
GitHub repository name |
directory_path |
content/en |
Documentation directory path |
chunk_size |
1000 |
Text chunk size for embedding |
chunk_overlap |
100 |
Overlap between chunks |
base_url |
https://www.kubeflow.org/docs |
Base URL for citations |
Currently, the system uses browser local storage for chat history management to:
- Reduce Server Overhead: No server-side storage requirements
- Improve Performance: Client-side handling of chat state
- Ensure Privacy: Data stays on the user's device
- Chat History Summarization: Implement conversation summarization to prevent token overflow
- Persistent Storage: Optional server-side chat history storage
- Multi-session Support: Support for multiple concurrent chat sessions
- RBAC Errors: Ensure proper service account permissions are set
- SSL Certificate Issues: Verify certificate validity and browser trust
- GPU Resource Constraints: Check GPU availability and memory allocation
- Milvus Connection: Verify network connectivity and service discovery
# Check Milvus status
kubectl get pods -n santhosh | grep milvus
# Check KServe status
kubectl get inferenceservice -n santhosh
# Check API server logs
kubectl logs -f deployment/docs-assistant-api
# Test Milvus connection
python -c "from pymilvus import connections; connections.connect('default', host='your-milvus-host', port='19530'); print('Connected!')"We welcome contributions! Please see our contributing guidelines for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Francisco Javier Arceo - Project mentor and guidance
- Chase Christensen - Project mentor and technical support
- Google Summer of Code (GSoC) for providing this incredible opportunity
- Red Hat AI for providing the Llama 3.1-8B model
- Hugging Face for the model hosting and sentence transformers library
- Oracle Cloud Infrastructure (OCI) for providing cloud resources and infrastructure
- Kubeflow Community for the KEP-867 proposal
- Milvus for the vector database
- KServe for model serving
- vLLM for high-performance LLM inference