Llama Stack Sandbox - LLM Agent Evaluation Framework

A simplistic sandbox which includes a even more simplistic evaluation framework (uses internally DeepEval) for testing Large Language Model (LLM) agents running on the a Llama Stack Distribution. This repository provides tools for connecting to multiple LLMs and MCP (Model Context Protocol) servers, running structured evaluations using CSV test datasets, and generating detailed evaluation visualizations.

🎯 Purpose

This sandbox environment enables:

Local Llama Stack Distribution: Run a Llama Stack containers locally with multiple LLM providers
MCP Server Integration: Connect to Model Context Protocol servers for enhanced tool capabilities
Structured Testing: Define and execute test cases using CSV files with expected outcomes
Multi-Metric Evaluation: Assess agent performance across QA accuracy, tool selection, parameter handling, and response quality
Interactive Visualization: Generate comprehensive dashboards and charts for evaluation results analysis
Modular Architecture: Clean separation between container management, evaluation, and visualization components

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   LLM Models    │    │ Llama Stack     │    │ MCP Servers     │
│                 │    │ Distribution    │    │                 │
│ • Llama 3.1 8B  │◄───┤                 ├───►│ • Compatibility │
│ • Granite 3.3   │    │ • Agents API    │    │ • Eligibility   │
│ • Scout 17B     │    │ • Tool Runtime  │    │ • Custom Tools  │
└─────────────────┘    │ • Vector Store  │    └─────────────────┘
                       │ • Safety        │
                       └─────────────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │  Evaluation Engine   │
                    │                      │
                    │ • DeepEval Framework │
                    │ • Custom Metrics     │
                    │ • CSV Test Runner    │
                    │ • Result Analyzer    │
                    └──────────────────────┘
                               │
                               ▼
                    ┌──────────────────────┐
                    │   Visualization      │
                    │                      │
                    │ • Interactive Plots  │
                    │ • HTML Dashboards    │
                    │ • Performance Charts │
                    └──────────────────────┘

📦 Installation & Setup

Prerequisites

Python 3.10-3.11 (required for dependencies compatibility)
UV (recommended) or pip for dependency management
Podman (for running Llama Stack Distribution)
curl and jq (for API testing)

1. Clone and Install Dependencies

git clone <repository-url>
cd llama-stack-sandbox

# Install using uv (recommended)
uv sync

# OR install using pip
pip install -r requirements.txt

2. Environment Configuration

Create a .env file with your model and MCP server configurations:

CAUTION: This example .env considers MCP servers run locally $(hostname) change it if necessary.

NOTE: This sandbox has been tested with models running with vLLM, hence exporting the OpenAI completions API.

# Model configurations (add as many as needed)
MODEL_1_URL=https://your-llama-model-endpoint.com/v1
MODEL_1_API_TOKEN=your_api_token_here
MODEL_1_MODEL=llama-3-1-8b-w4a16
MODEL_1_MAX_TOKENS=4096
MODEL_1_TLS_VERIFY=false

MODEL_2_URL=https://your-granite-model-endpoint.com/v1
MODEL_2_API_TOKEN=another_api_token
MODEL_2_MODEL=granite-3-3-8b
MODEL_2_MAX_TOKENS=4096
MODEL_2_TLS_VERIFY=false

# MCP Server configurations (more on this later)
MCP_SERVER_1_ID=mcp::eligibility
MCP_SERVER_1_URI=http://$(hostname):8001/sse

MCP_SERVER_2_ID=mcp::compatibility
MCP_SERVER_2_URI=http://$(hostname):8002/sse

# Required for Llama Stack
export LLAMA_STACK_HOST=$(hostname)
export LLAMA_STACK_PORT=8080
export MILVUS_DB_PATH=/opt/app-root/.milvus/milvus.db
export FMS_ORCHESTRATOR_URL=http://localhost
export INFERENCE_MODEL=${MODEL_1_MODEL}

Running sample local MCP Servers

We provide a couple of MCP Servers:

Eligibility Engine: An example Model Context Protocol (MCP) server developed in Rust that demonstrates how to evaluate complex business rules using the ZEN Engine decision engine. The purpose of this MCP Server is evaluating if you're eligible for unpaid leave aid given certain circumstances.
Compatibility Engine: An example Model Context Protocol (MCP) server developed in Rust that provides five strongly-typed calculation and compatibility functions: calc_penalty, calc_tax, check_voting, distribute_waterfall, check_housing_grant

Open a terminal and run the Eligibility MCP Server:

podman run -it --rm --name eligibility \
  -e RUST_LOG=debug -e BIND_ADDRESS=0.0.0.0:8001 \
  quay.io/atarazana/eligibility-engine-mcp-rs:latest

Open a second terminal and run the Compatibility MCP Server:

podman run -it --rm --name compatibility \
  -e RUST_LOG=debug -e BIND_ADDRESS=0.0.0.0:8002 \
  quay.io/atarazana/compatibility-engine-mcp-rs:latest

3. Start Llama Stack Distribution

# Generate configuration and start the stack
./run.sh

This script will:

Auto-discover models and MCP servers from .env
Generate dynamic run.yaml configuration
Start the Llama Stack Distribution container
Test connectivity to all configured models

📊 CSV Test Datasets

Test File Format

Create test cases in CSV format with the following structure:

question,expected_answer,tool_name,tool_parameters,evaluation_criteria,category
"Calculate penalty for 15 days late delivery","1050 total penalty. Base: 1500, capped at 1000. Interest: 50.",calc_penalty,"{""days_late"": 15}","Correct tool, accurate calculation, mentions cap",Penalty Calculations
"Check tax for 40000 income","7140 total tax. Bracket 1: 1000. Bracket 2: 6000. Surcharge: 140.",calc_tax,"{""income"": 40000}","Progressive calculation, surcharge applied",Tax Calculations

Available Test Categories

The framework includes pre-built test cases for:

Penalty Calculations: Late payment and delivery penalties with caps and interest
Tax Calculations: Progressive tax brackets with surcharges
Voting Validations: Meeting quorum and threshold validations
Waterfall Distributions: Financial distribution calculations
Housing Grant Eligibility: Multi-criteria eligibility assessments

Sample Test Files

scratch/compatibility-full.csv - Complete test suite (21 test cases)
scratch/compatibility.csv - Subset for quick testing

🧪 Running Evaluations

Quick Start

NOTE: The default output directory is evaluation_results

# Run evaluation with default settings
./evaluate.sh

# Run with specific parameters 1
./evaluate.sh run -c scratch/compatibility-full.csv -m llama-3-1-8b-w4a16 -v

# Run with specific parameters 2
./evaluate.sh --csv scratch/compatibility-full.csv --url https://lsd-route-eligibility-mcp-llamastack.apps.acme.com --model llama-3-1-8b-w4a16

Using Python Modules Directly

# Using uv (recommended)
uv run -m evaluate scratch/compatibility-full.csv \
    --model "llama-3-1-8b-w4a16" \
    --stack-url "http://localhost:8080" \
    --tools "mcp::compatibility" \
    --output "results/my_evaluation.json" \
    --verbose

Evaluation Script Options

-c, --csv FILE: Test case CSV file path
-m, --model MODEL: LLM model identifier
-t, --tools TOOLS: Space-separated tool groups
-u, --url URL: Llama Stack server URL
-o, --output FILE: Results output file
-v, --verbose: Enable detailed logging

📈 Evaluation Metrics

The framework uses custom DeepEval metrics to assess agent performance:

Core Metrics

Tool Selection Accuracy: Did the agent select the correct tool?
Parameter Accuracy: Were the tool parameters extracted correctly?
Response Accuracy: Does the response match expected outcomes?
Comprehensive Evaluation: Combined semantic and structural analysis

Metric Configuration

These core metrics have predefined thresholds you can change. These thresholds are there to decide if a metric results is considered successful or not:

metrics = [
   ToolSelectionMetric(agent_wrapper=self.agent_wrapper, threshold=1.0),           # 100%
   ParameterAccuracyMetric(agent_wrapper=self.agent_wrapper, threshold=0.9),       # 90%
   ResponseAccuracyMetric(agent_wrapper=self.agent_wrapper, threshold=0.7),        # 70%
   EvaluationCriteriaMetric(agent_wrapper=self.agent_wrapper, threshold=0.7),      # 70%
   ComprehensiveEvaluationMetric(agent_wrapper=self.agent_wrapper, threshold=0.7)  # 70%
]

Results Structure

Evaluation results include:

Overall Statistics: Pass rates, average scores, timing metrics
Per-Test Details: Individual scores, tool calls, response analysis
Category Analysis: Performance breakdown by test type
Error Analysis: Common failure patterns and issues

{
  "summary": {
    "total_test_cases": 21,
    "successful_evaluations": 21,
    "failed_evaluations": 0,
    "success_rate": 1.0
  },
  "metric_averages": {
    "ToolSelectionMetric": {
      "average_score": 1.0,
      "success_rate": 1.0
    },
    "ParameterAccuracyMetric": {
      "average_score": 0.9761904761904762,
      "success_rate": 1.0
    },
    "ResponseAccuracyMetric": {
      "average_score": 0.955391351943076,
      "success_rate": 0.9047619047619048
    },
    "EvaluationCriteriaMetric": {
      "average_score": 0.8026984126984128,
      "success_rate": 0.9047619047619048
    },
    "ComprehensiveEvaluationMetric": {
      "average_score": 0.9750136836343734,
      "success_rate": 1.0
    }
  },
  "category_results": {
    "Eligibility Calculations": [
      {
        "test_case_index": 0,
        "input": "My mother had an accident and she's at the hospital. I have to take care of her, can I get access to the unpaid leave aid?",
        "expected_output": "The unpaid leave aid is potentially eligible for you, as you are taking care of your mother who had an accident and is hospitalized. The monthly benefit would be 725 euros..",
        "actual_output": "You are eligible for the unpaid leave aid. You will receive a monthly benefit of 725€. The person must have been hospitalized and the care of the person must be continued.",
        "original_test_case": {
          "question": "My mother had an accident and she's at the hospital. I have to take care of her, can I get access to the unpaid leave aid?",
          "expected_answer": "The unpaid leave aid is potentially eligible for you, as you are taking care of your mother who had an accident and is hospitalized. The monthly benefit would be 725 euros..",
          "tool_name": "evaluate_unpaid_leave_eligibility",
          "tool_parameters": {
            "is_single_parent": "false",
            "relationship": "son|daughter",
            "situation": "accident|injury|illness",
            "total_children_after": "0"
          },
          "evaluation_criteria": "correct tool selection, accurate calculation, mentions hospitalized|hospital, case A",
          "category": "Eligibility Calculations"
        },
        "metric_results": {
          "ToolSelectionMetric": {
            "score": 1.0,
            "success": true,
            "reason": "Correctly selected tool: evaluate_unpaid_leave_eligibility",
            "strict_mode": false
          },
          "ParameterAccuracyMetric": {
            "score": 1.0,
            "success": true,
            "reason": "4/4 parameters correct",
            "strict_mode": false
          },
          "ResponseAccuracyMetric": {
            "score": 1.0,
            "success": true,
            "reason": "Status matches; Main amount accuracy: 1.00; Warning presence matches",
            "strict_mode": false
          },
          "EvaluationCriteriaMetric": {
            "score": 0.75,
            "success": true,
            "reason": "✓ correct tool selection: Tool selection handled by other metric | ✓ accurate calculation: Calculation accuracy: 1/1 numbers match | ✓ mentions hospitalized|hospital: Found 'hospitalized' from ORed values: hospitalized|hospital | ✗ case A: Missing case A reference",
            "strict_mode": false
          },
          "ComprehensiveEvaluationMetric": {
            "score": 1.0,
            "success": true,
            "reason": "Tool Selection (30.0%): 1.00 - Correctly selected tool: evaluate_unpaid_leave_eligibility | Parameter Accuracy (30.0%): 1.00 - 4/4 parameters correct | Response Accuracy (40.0%): 1.00 - Status matches; Main amount accuracy: 1.00; Warning presence matches | Weighted Score: 1.000",
            "strict_mode": false
          }
        }
      },
      ...
  ],
  "configuration": {
    "model_id": "llama-3-1-8b-w4a16",
    "tool_groups": [
      "mcp::compatibility-engine",
      "mcp::eligibility-engine"
    ],
    "stack_url": "https://eligibility-lsd-route-eligibility-mcp-llamastack.apps.acme.com"
  }
}

📊 Visualization & Reporting

Generate Visualizations

# Create comprehensive dashboard (using uv)
uv run -m visualize evaluation_results/evaluation_results_TIMESTAMP.json

# Create specific chart types
uv run -m visualize visualize results.json --type summary

# Upload to DeepEval cloud dashboard
uv run -m visualize dashboard results.json --login

# Open the latest dashboards automatically (it will find the latest results in `evaluation_results` folder)
./visualize.sh

Available Visualizations

Performance Overview: Success rates, score distributions
Category Analysis: Performance by test category
Timeline Charts: Response times and processing metrics
Comparison Plots: Multi-model performance comparison
Interactive Dashboards: HTML reports with drill-down capabilities

Dashboard Features

Interactive Plots: Plotly-based charts with zoom, filter, and export
Performance Tables: Sortable results with detailed breakdowns
Error Analysis: Failure pattern identification and categorization
Export Options: PNG, SVG, PDF, and raw data downloads

🗂️ Project Structure

llama-stack-sandbox/
├── 📦 Core Framework
├── README.md                     # This file
├── requirements.txt              # Python dependencies
├── pyproject.toml               # Modern Python config
├── .env                         # Environment variables (create this)
│
├── 🐳 Infrastructure (Container & Stack Management)
├── run.sh                       # Llama Stack Distribution launcher  
├── playground.sh                # Interactive testing environment
├── run/
│   ├── __init__.py
│   ├── __main__.py              # Entry point: uv run -m run
│   ├── config.py                # Configuration parsing
│   └── yaml_generator.py        # Dynamic run.yaml generation
├── templates/
│   └── run.yaml.template        # Jinja2 template
├── run.yaml                     # Generated config (auto-created)
│
├── 🧪 Evaluation Framework
├── evaluate.sh                  # Evaluation runner script
├── evaluate/
│   ├── __init__.py
│   ├── __main__.py              # Entry point: uv run -m evaluate
│   ├── evaluator.py             # Main evaluation orchestrator
│   ├── metrics.py               # Custom DeepEval metrics (1000+ lines)
│   ├── wrapper.py               # Agent wrapper and session management
│   ├── config.py                # Evaluation configuration
│   ├── loader.py                # CSV test case loading utilities
│   └── examples.py              # Example usage and demonstrations
│
├── 📊 Visualization & Reporting
├── visualize.sh                 # Quick dashboard opener
├── visualize/
│   ├── __init__.py
│   ├── __main__.py              # Entry point: uv run -m visualize
│   ├── results.py               # Chart and dashboard generator
│   └── dashboard.py             # DeepEval cloud dashboard integration
│
├── 🗂️ Data & Configuration
├── scratch/
│   ├── compatibility-full.csv   # Complete test suite (21 cases)
│   ├── compatibility.csv        # Quick test subset
│   └── compatibility-errors.csv # Error scenarios
├── configs/
│   └── sample_evaluation_config.yaml
├── docs/                        # Knowledge base documents
│   ├── LyFin-Compliance-Annex.md
│   └── ...
│
├── 📈 Results & Outputs (auto-created)
└── evaluation_results/          # JSON results with timestamps
    ├── evaluation_results_YYYYMMDD_HHMMSS.json
    ├── logs/                    # Detailed execution logs
    ├── reports/                 # Generated analysis reports  
    └── visualizations/          # HTML dashboards and charts
        ├── comprehensive_dashboard.html
        ├── detailed_analysis.html
        └── summary_dashboard.html

🔧 Configuration Files

Llama Stack Configuration (`run.yaml`)

The run.yaml file defines:

Model Providers: Remote VLLM endpoints, embedding models
Tool Runtime: MCP server connections, search providers
Storage: Vector databases, agent persistence, telemetry
APIs: Enabled Llama Stack APIs (agents, inference, eval, etc.)

Generated dynamically from .env using:

uv run -m run

Evaluation Configuration

Configure evaluation behavior in configs/sample_evaluation_config.yaml:

model_config:
  default_model: "llama-3-1-8b-w4a16"
  tool_groups: ["mcp::compatibility"]
  
evaluation_settings:
  metrics: ["tool_selection", "parameter_accuracy", "response_accuracy"]
  thresholds:
    tool_selection: 1.0
    parameter_accuracy: 0.95
    response_accuracy: 0.8
    
output_settings:
  save_intermediate: true
  generate_visualizations: true
  create_html_dashboard: true

🚀 Quick Start Guide

Setup Environment:

# Install dependencies with uv (recommended)
uv sync

# Configure your models and MCP servers in .env
cp .env.example .env
# Edit .env with your configurations

Start Llama Stack:
```
./run.sh
```
Run Quick Test:
```
./evaluate.sh test
```

Execute Full Evaluation:

./evaluate.sh run -c scratch/compatibility-full.csv -v

View Results:

# Generate and open dashboard
./visualize.sh

🔍 Advanced Usage

Multi-Model Comparison

# Run evaluations with different models
./evaluate.sh run -m llama-3-1-8b-w4a16 -o results_llama.json
./evaluate.sh run -m granite-3-3-8b -o results_granite.json

# Compare results using the visualization module
uv run -m visualize visualize results_llama.json --type detailed
uv run -m visualize visualize results_granite.json --type detailed

Custom Test Creation

Create CSV File:

question,expected_answer,tool_name,tool_parameters,evaluation_criteria,category
"Your test question","Expected response","tool_name","{\"param\":\"value\"}","Criteria for success","Custom Category"

Run Evaluation:
```
./evaluate.sh run -c your_tests.csv
```
Analyze Results:
```
uv run -m visualize your_results.json
```

Direct Python Usage

# Import the evaluation framework
from evaluate.evaluator import LlamaStackEvaluator
from evaluate.metrics import ToolSelectionMetric
from evaluate.loader import CSVTestCaseLoader

# Initialize evaluator
evaluator = LlamaStackEvaluator(
    stack_url="http://localhost:8080",
    model_id="llama-3-1-8b-w4a16",
    tool_groups=["mcp::compatibility"]
)

# Run evaluation
results = await evaluator.run_evaluation(
    csv_file_path="scratch/compatibility-full.csv",
    output_file="my_results.json",
    verbose=True
)

Integration with CI/CD

# Automated evaluation pipeline
./evaluate.sh validate -c tests/regression_tests.csv
./evaluate.sh run -c tests/regression_tests.csv --output results/ci_results.json
uv run -m visualize visualize results/ci_results.json --type summary

🐛 Troubleshooting

Common Issues

Llama Stack Connection Failed:

# Check if container is running
podman ps | grep llama-stack

# Check logs
podman logs llama-stack

# Test connectivity
./evaluate.sh test

Model Authentication Errors:
- Verify API tokens in .env file
- Check model endpoint URLs are accessible
- Ensure TLS settings match your endpoints
MCP Server Connection Issues:
- Confirm MCP servers are running on specified ports
- Check firewall settings for localhost connections
- Verify MCP server URIs in .env

Python Module Import Errors:

# Ensure proper installation
uv sync

# Check if modules are accessible
uv run python -c "from evaluate import evaluator; print('OK')"

Evaluation Failures:

# Run with verbose logging
./evaluate.sh run -v

# Check specific module
uv run -m evaluate --help

# Validate test CSV format
./evaluate.sh validate -c your_test_file.csv

🆕 What's New in This Version

Modular Architecture

Separated Concerns: Infrastructure (run/), Evaluation (evaluate/), Visualization (visualize/)
Python Packages: Each component is now a proper Python package with __main__.py entry points
UV Integration: Full support for modern Python dependency management with UV

Improved Commands

Simplified Entry Points: uv run -m evaluate, uv run -m visualize, uv run -m run
Enhanced Shell Scripts: Updated evaluate.sh, visualize.sh, run.sh with better error handling
Backward Compatibility: Old usage patterns still supported during transition

Better Organization

Clear Directory Structure: Related files grouped logically
Legacy Preservation: Old files moved to old/ directory for reference
Auto-Generated Outputs: Results and visualizations organized systematically

📚 Additional Resources

Llama Stack Documentation: Official Llama Stack Docs
DeepEval Framework: DeepEval Documentation
Model Context Protocol: MCP Specification
UV Package Manager: UV Documentation

🤝 Contributing

Fork the repository
Create a feature branch
Add tests for new functionality
Update documentation
Submit a pull request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Happy Evaluating! 🚀

For questions or issues, please check the troubleshooting section or create an issue in the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
configs		configs
docs		docs
evaluate		evaluate
run		run
scratch		scratch
templates		templates
visualize		visualize
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
evaluate.sh		evaluate.sh
playground.sh		playground.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh
uv.lock		uv.lock
visualize.sh		visualize.sh

alpha-hack-program/llama-stack-sandbox

Folders and files

Latest commit

History

Repository files navigation