A simplistic sandbox which includes a even more simplistic evaluation framework (uses internally DeepEval) for testing Large Language Model (LLM) agents running on the a Llama Stack Distribution. This repository provides tools for connecting to multiple LLMs and MCP (Model Context Protocol) servers, running structured evaluations using CSV test datasets, and generating detailed evaluation visualizations.
This sandbox environment enables:
- Local Llama Stack Distribution: Run a Llama Stack containers locally with multiple LLM providers
- MCP Server Integration: Connect to Model Context Protocol servers for enhanced tool capabilities
- Structured Testing: Define and execute test cases using CSV files with expected outcomes
- Multi-Metric Evaluation: Assess agent performance across QA accuracy, tool selection, parameter handling, and response quality
- Interactive Visualization: Generate comprehensive dashboards and charts for evaluation results analysis
- Modular Architecture: Clean separation between container management, evaluation, and visualization components
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β LLM Models β β Llama Stack β β MCP Servers β
β β β Distribution β β β
β β’ Llama 3.1 8B ββββββ€ βββββΊβ β’ Compatibility β
β β’ Granite 3.3 β β β’ Agents API β β β’ Eligibility β
β β’ Scout 17B β β β’ Tool Runtime β β β’ Custom Tools β
βββββββββββββββββββ β β’ Vector Store β βββββββββββββββββββ
β β’ Safety β
βββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Evaluation Engine β
β β
β β’ DeepEval Framework β
β β’ Custom Metrics β
β β’ CSV Test Runner β
β β’ Result Analyzer β
ββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββ
β Visualization β
β β
β β’ Interactive Plots β
β β’ HTML Dashboards β
β β’ Performance Charts β
ββββββββββββββββββββββββ
- Python 3.10-3.11 (required for dependencies compatibility)
- UV (recommended) or pip for dependency management
- Podman (for running Llama Stack Distribution)
- curl and jq (for API testing)
git clone <repository-url>
cd llama-stack-sandbox
# Install using uv (recommended)
uv sync
# OR install using pip
pip install -r requirements.txtCreate a .env file with your model and MCP server configurations:
CAUTION: This example
.envconsiders MCP servers run locally$(hostname)change it if necessary.
NOTE: This sandbox has been tested with models running with vLLM, hence exporting the OpenAI completions API.
# Model configurations (add as many as needed)
MODEL_1_URL=https://your-llama-model-endpoint.com/v1
MODEL_1_API_TOKEN=your_api_token_here
MODEL_1_MODEL=llama-3-1-8b-w4a16
MODEL_1_MAX_TOKENS=4096
MODEL_1_TLS_VERIFY=false
MODEL_2_URL=https://your-granite-model-endpoint.com/v1
MODEL_2_API_TOKEN=another_api_token
MODEL_2_MODEL=granite-3-3-8b
MODEL_2_MAX_TOKENS=4096
MODEL_2_TLS_VERIFY=false
# MCP Server configurations (more on this later)
MCP_SERVER_1_ID=mcp::eligibility
MCP_SERVER_1_URI=http://$(hostname):8001/sse
MCP_SERVER_2_ID=mcp::compatibility
MCP_SERVER_2_URI=http://$(hostname):8002/sse
# Required for Llama Stack
export LLAMA_STACK_HOST=$(hostname)
export LLAMA_STACK_PORT=8080
export MILVUS_DB_PATH=/opt/app-root/.milvus/milvus.db
export FMS_ORCHESTRATOR_URL=http://localhost
export INFERENCE_MODEL=${MODEL_1_MODEL}We provide a couple of MCP Servers:
- Eligibility Engine: An example Model Context Protocol (MCP) server developed in Rust that demonstrates how to evaluate complex business rules using the ZEN Engine decision engine. The purpose of this MCP Server is evaluating if you're eligible for unpaid leave aid given certain circumstances.
- Compatibility Engine: An example Model Context Protocol (MCP) server developed in Rust that provides five strongly-typed calculation and compatibility functions: calc_penalty, calc_tax, check_voting, distribute_waterfall, check_housing_grant
Open a terminal and run the Eligibility MCP Server:
podman run -it --rm --name eligibility \
-e RUST_LOG=debug -e BIND_ADDRESS=0.0.0.0:8001 \
quay.io/atarazana/eligibility-engine-mcp-rs:latestOpen a second terminal and run the Compatibility MCP Server:
podman run -it --rm --name compatibility \
-e RUST_LOG=debug -e BIND_ADDRESS=0.0.0.0:8002 \
quay.io/atarazana/compatibility-engine-mcp-rs:latest# Generate configuration and start the stack
./run.shThis script will:
- Auto-discover models and MCP servers from
.env - Generate dynamic
run.yamlconfiguration - Start the Llama Stack Distribution container
- Test connectivity to all configured models
Create test cases in CSV format with the following structure:
question,expected_answer,tool_name,tool_parameters,evaluation_criteria,category
"Calculate penalty for 15 days late delivery","1050 total penalty. Base: 1500, capped at 1000. Interest: 50.",calc_penalty,"{""days_late"": 15}","Correct tool, accurate calculation, mentions cap",Penalty Calculations
"Check tax for 40000 income","7140 total tax. Bracket 1: 1000. Bracket 2: 6000. Surcharge: 140.",calc_tax,"{""income"": 40000}","Progressive calculation, surcharge applied",Tax Calculations
The framework includes pre-built test cases for:
- Penalty Calculations: Late payment and delivery penalties with caps and interest
- Tax Calculations: Progressive tax brackets with surcharges
- Voting Validations: Meeting quorum and threshold validations
- Waterfall Distributions: Financial distribution calculations
- Housing Grant Eligibility: Multi-criteria eligibility assessments
scratch/compatibility-full.csv- Complete test suite (21 test cases)scratch/compatibility.csv- Subset for quick testing
NOTE: The default output directory is
evaluation_results
# Run evaluation with default settings
./evaluate.sh
# Run with specific parameters 1
./evaluate.sh run -c scratch/compatibility-full.csv -m llama-3-1-8b-w4a16 -v
# Run with specific parameters 2
./evaluate.sh --csv scratch/compatibility-full.csv --url https://lsd-route-eligibility-mcp-llamastack.apps.acme.com --model llama-3-1-8b-w4a16# Using uv (recommended)
uv run -m evaluate scratch/compatibility-full.csv \
--model "llama-3-1-8b-w4a16" \
--stack-url "http://localhost:8080" \
--tools "mcp::compatibility" \
--output "results/my_evaluation.json" \
--verbose-c, --csv FILE: Test case CSV file path-m, --model MODEL: LLM model identifier-t, --tools TOOLS: Space-separated tool groups-u, --url URL: Llama Stack server URL-o, --output FILE: Results output file-v, --verbose: Enable detailed logging
The framework uses custom DeepEval metrics to assess agent performance:
- Tool Selection Accuracy: Did the agent select the correct tool?
- Parameter Accuracy: Were the tool parameters extracted correctly?
- Response Accuracy: Does the response match expected outcomes?
- Comprehensive Evaluation: Combined semantic and structural analysis
These core metrics have predefined thresholds you can change. These thresholds are there to decide if a metric results is considered successful or not:
metrics = [
ToolSelectionMetric(agent_wrapper=self.agent_wrapper, threshold=1.0), # 100%
ParameterAccuracyMetric(agent_wrapper=self.agent_wrapper, threshold=0.9), # 90%
ResponseAccuracyMetric(agent_wrapper=self.agent_wrapper, threshold=0.7), # 70%
EvaluationCriteriaMetric(agent_wrapper=self.agent_wrapper, threshold=0.7), # 70%
ComprehensiveEvaluationMetric(agent_wrapper=self.agent_wrapper, threshold=0.7) # 70%
]Evaluation results include:
- Overall Statistics: Pass rates, average scores, timing metrics
- Per-Test Details: Individual scores, tool calls, response analysis
- Category Analysis: Performance breakdown by test type
- Error Analysis: Common failure patterns and issues
{
"summary": {
"total_test_cases": 21,
"successful_evaluations": 21,
"failed_evaluations": 0,
"success_rate": 1.0
},
"metric_averages": {
"ToolSelectionMetric": {
"average_score": 1.0,
"success_rate": 1.0
},
"ParameterAccuracyMetric": {
"average_score": 0.9761904761904762,
"success_rate": 1.0
},
"ResponseAccuracyMetric": {
"average_score": 0.955391351943076,
"success_rate": 0.9047619047619048
},
"EvaluationCriteriaMetric": {
"average_score": 0.8026984126984128,
"success_rate": 0.9047619047619048
},
"ComprehensiveEvaluationMetric": {
"average_score": 0.9750136836343734,
"success_rate": 1.0
}
},
"category_results": {
"Eligibility Calculations": [
{
"test_case_index": 0,
"input": "My mother had an accident and she's at the hospital. I have to take care of her, can I get access to the unpaid leave aid?",
"expected_output": "The unpaid leave aid is potentially eligible for you, as you are taking care of your mother who had an accident and is hospitalized. The monthly benefit would be 725 euros..",
"actual_output": "You are eligible for the unpaid leave aid. You will receive a monthly benefit of 725β¬. The person must have been hospitalized and the care of the person must be continued.",
"original_test_case": {
"question": "My mother had an accident and she's at the hospital. I have to take care of her, can I get access to the unpaid leave aid?",
"expected_answer": "The unpaid leave aid is potentially eligible for you, as you are taking care of your mother who had an accident and is hospitalized. The monthly benefit would be 725 euros..",
"tool_name": "evaluate_unpaid_leave_eligibility",
"tool_parameters": {
"is_single_parent": "false",
"relationship": "son|daughter",
"situation": "accident|injury|illness",
"total_children_after": "0"
},
"evaluation_criteria": "correct tool selection, accurate calculation, mentions hospitalized|hospital, case A",
"category": "Eligibility Calculations"
},
"metric_results": {
"ToolSelectionMetric": {
"score": 1.0,
"success": true,
"reason": "Correctly selected tool: evaluate_unpaid_leave_eligibility",
"strict_mode": false
},
"ParameterAccuracyMetric": {
"score": 1.0,
"success": true,
"reason": "4/4 parameters correct",
"strict_mode": false
},
"ResponseAccuracyMetric": {
"score": 1.0,
"success": true,
"reason": "Status matches; Main amount accuracy: 1.00; Warning presence matches",
"strict_mode": false
},
"EvaluationCriteriaMetric": {
"score": 0.75,
"success": true,
"reason": "β correct tool selection: Tool selection handled by other metric | β accurate calculation: Calculation accuracy: 1/1 numbers match | β mentions hospitalized|hospital: Found 'hospitalized' from ORed values: hospitalized|hospital | β case A: Missing case A reference",
"strict_mode": false
},
"ComprehensiveEvaluationMetric": {
"score": 1.0,
"success": true,
"reason": "Tool Selection (30.0%): 1.00 - Correctly selected tool: evaluate_unpaid_leave_eligibility | Parameter Accuracy (30.0%): 1.00 - 4/4 parameters correct | Response Accuracy (40.0%): 1.00 - Status matches; Main amount accuracy: 1.00; Warning presence matches | Weighted Score: 1.000",
"strict_mode": false
}
}
},
...
],
"configuration": {
"model_id": "llama-3-1-8b-w4a16",
"tool_groups": [
"mcp::compatibility-engine",
"mcp::eligibility-engine"
],
"stack_url": "https://eligibility-lsd-route-eligibility-mcp-llamastack.apps.acme.com"
}
}# Create comprehensive dashboard (using uv)
uv run -m visualize evaluation_results/evaluation_results_TIMESTAMP.json
# Create specific chart types
uv run -m visualize visualize results.json --type summary
# Upload to DeepEval cloud dashboard
uv run -m visualize dashboard results.json --login
# Open the latest dashboards automatically (it will find the latest results in `evaluation_results` folder)
./visualize.sh- Performance Overview: Success rates, score distributions
- Category Analysis: Performance by test category
- Timeline Charts: Response times and processing metrics
- Comparison Plots: Multi-model performance comparison
- Interactive Dashboards: HTML reports with drill-down capabilities
- Interactive Plots: Plotly-based charts with zoom, filter, and export
- Performance Tables: Sortable results with detailed breakdowns
- Error Analysis: Failure pattern identification and categorization
- Export Options: PNG, SVG, PDF, and raw data downloads
llama-stack-sandbox/
βββ π¦ Core Framework
βββ README.md # This file
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Modern Python config
βββ .env # Environment variables (create this)
β
βββ π³ Infrastructure (Container & Stack Management)
βββ run.sh # Llama Stack Distribution launcher
βββ playground.sh # Interactive testing environment
βββ run/
β βββ __init__.py
β βββ __main__.py # Entry point: uv run -m run
β βββ config.py # Configuration parsing
β βββ yaml_generator.py # Dynamic run.yaml generation
βββ templates/
β βββ run.yaml.template # Jinja2 template
βββ run.yaml # Generated config (auto-created)
β
βββ π§ͺ Evaluation Framework
βββ evaluate.sh # Evaluation runner script
βββ evaluate/
β βββ __init__.py
β βββ __main__.py # Entry point: uv run -m evaluate
β βββ evaluator.py # Main evaluation orchestrator
β βββ metrics.py # Custom DeepEval metrics (1000+ lines)
β βββ wrapper.py # Agent wrapper and session management
β βββ config.py # Evaluation configuration
β βββ loader.py # CSV test case loading utilities
β βββ examples.py # Example usage and demonstrations
β
βββ π Visualization & Reporting
βββ visualize.sh # Quick dashboard opener
βββ visualize/
β βββ __init__.py
β βββ __main__.py # Entry point: uv run -m visualize
β βββ results.py # Chart and dashboard generator
β βββ dashboard.py # DeepEval cloud dashboard integration
β
βββ ποΈ Data & Configuration
βββ scratch/
β βββ compatibility-full.csv # Complete test suite (21 cases)
β βββ compatibility.csv # Quick test subset
β βββ compatibility-errors.csv # Error scenarios
βββ configs/
β βββ sample_evaluation_config.yaml
βββ docs/ # Knowledge base documents
β βββ LyFin-Compliance-Annex.md
β βββ ...
β
βββ π Results & Outputs (auto-created)
βββ evaluation_results/ # JSON results with timestamps
βββ evaluation_results_YYYYMMDD_HHMMSS.json
βββ logs/ # Detailed execution logs
βββ reports/ # Generated analysis reports
βββ visualizations/ # HTML dashboards and charts
βββ comprehensive_dashboard.html
βββ detailed_analysis.html
βββ summary_dashboard.html
The run.yaml file defines:
- Model Providers: Remote VLLM endpoints, embedding models
- Tool Runtime: MCP server connections, search providers
- Storage: Vector databases, agent persistence, telemetry
- APIs: Enabled Llama Stack APIs (agents, inference, eval, etc.)
Generated dynamically from .env using:
uv run -m runConfigure evaluation behavior in configs/sample_evaluation_config.yaml:
model_config:
default_model: "llama-3-1-8b-w4a16"
tool_groups: ["mcp::compatibility"]
evaluation_settings:
metrics: ["tool_selection", "parameter_accuracy", "response_accuracy"]
thresholds:
tool_selection: 1.0
parameter_accuracy: 0.95
response_accuracy: 0.8
output_settings:
save_intermediate: true
generate_visualizations: true
create_html_dashboard: true-
Setup Environment:
# Install dependencies with uv (recommended) uv sync # Configure your models and MCP servers in .env cp .env.example .env # Edit .env with your configurations
-
Start Llama Stack:
./run.sh
-
Run Quick Test:
./evaluate.sh test -
Execute Full Evaluation:
./evaluate.sh run -c scratch/compatibility-full.csv -v
-
View Results:
# Generate and open dashboard ./visualize.sh
# Run evaluations with different models
./evaluate.sh run -m llama-3-1-8b-w4a16 -o results_llama.json
./evaluate.sh run -m granite-3-3-8b -o results_granite.json
# Compare results using the visualization module
uv run -m visualize visualize results_llama.json --type detailed
uv run -m visualize visualize results_granite.json --type detailed-
Create CSV File:
question,expected_answer,tool_name,tool_parameters,evaluation_criteria,category "Your test question","Expected response","tool_name","{\"param\":\"value\"}","Criteria for success","Custom Category" -
Run Evaluation:
./evaluate.sh run -c your_tests.csv
-
Analyze Results:
uv run -m visualize your_results.json
# Import the evaluation framework
from evaluate.evaluator import LlamaStackEvaluator
from evaluate.metrics import ToolSelectionMetric
from evaluate.loader import CSVTestCaseLoader
# Initialize evaluator
evaluator = LlamaStackEvaluator(
stack_url="http://localhost:8080",
model_id="llama-3-1-8b-w4a16",
tool_groups=["mcp::compatibility"]
)
# Run evaluation
results = await evaluator.run_evaluation(
csv_file_path="scratch/compatibility-full.csv",
output_file="my_results.json",
verbose=True
)# Automated evaluation pipeline
./evaluate.sh validate -c tests/regression_tests.csv
./evaluate.sh run -c tests/regression_tests.csv --output results/ci_results.json
uv run -m visualize visualize results/ci_results.json --type summary-
Llama Stack Connection Failed:
# Check if container is running podman ps | grep llama-stack # Check logs podman logs llama-stack # Test connectivity ./evaluate.sh test
-
Model Authentication Errors:
- Verify API tokens in
.envfile - Check model endpoint URLs are accessible
- Ensure TLS settings match your endpoints
- Verify API tokens in
-
MCP Server Connection Issues:
- Confirm MCP servers are running on specified ports
- Check firewall settings for localhost connections
- Verify MCP server URIs in
.env
-
Python Module Import Errors:
# Ensure proper installation uv sync # Check if modules are accessible uv run python -c "from evaluate import evaluator; print('OK')"
-
Evaluation Failures:
# Run with verbose logging ./evaluate.sh run -v # Check specific module uv run -m evaluate --help # Validate test CSV format ./evaluate.sh validate -c your_test_file.csv
- Separated Concerns: Infrastructure (
run/), Evaluation (evaluate/), Visualization (visualize/) - Python Packages: Each component is now a proper Python package with
__main__.pyentry points - UV Integration: Full support for modern Python dependency management with UV
- Simplified Entry Points:
uv run -m evaluate,uv run -m visualize,uv run -m run - Enhanced Shell Scripts: Updated
evaluate.sh,visualize.sh,run.shwith better error handling - Backward Compatibility: Old usage patterns still supported during transition
- Clear Directory Structure: Related files grouped logically
- Legacy Preservation: Old files moved to
old/directory for reference - Auto-Generated Outputs: Results and visualizations organized systematically
- Llama Stack Documentation: Official Llama Stack Docs
- DeepEval Framework: DeepEval Documentation
- Model Context Protocol: MCP Specification
- UV Package Manager: UV Documentation
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Update documentation
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
Happy Evaluating! π
For questions or issues, please check the troubleshooting section or create an issue in the repository.