Portable evaluation bundles for agents and agent-shaped workflows: bounded, reproducible, regression-aware proof surfaces for quality claims.
-
Updated
May 22, 2026 - Python
Portable evaluation bundles for agents and agent-shaped workflows: bounded, reproducible, regression-aware proof surfaces for quality claims.
Production-path evals for AI agent behavior: persona drift, safety boundaries, and evidence discipline.
Production-minded LLM eval harness for safety, reliability, cost, and latency analysis.
Add a description, image, and links to the safety-evals topic page so that developers can more easily learn about it.
To associate your repository with the safety-evals topic, visit your repo's landing page and select "manage topics."