Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.
-
Updated
Nov 22, 2025 - TypeScript
Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.
An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"
Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.
A framework for evaluating large language models (LLMs) across a variety of tasks.
Squeeze your model with pressure prompts to see if its behavior leaks.
Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.
Evaluation patterns, release gates, and anti-hallucination techniques for developer-focused AI workflows.
In this we evaluate the LLM responses and find accuracy
Local-first LLM evaluation runner with baselines, caching, markdown reports, and CI-friendly quality, latency, and cost gates.
A practical workbench for prompt, model, and mocked workflow evaluation with repeatable benchmarks, structured graders, and agent episode traces.
Behavioral evaluation framework for sentience-, emotion-, and welfare-related AI claims, with anti-sandbagging analysis.
Agentic research pipeline with local retrieval, structured evaluation, conditional revision, and traceable outputs using Groq.
Evaluation and reliability harness for agentic LLM systems, with task success, latency, cost, retries, fallback routing, and failure taxonomy.
Lightweight CI-native regression and behavior-aware evaluation toolkit for black-box agent workflows.
Reproducible evaluation suite for LLM behavior research: epistemic pathology, delegated introspection, and temporal consciousness diagnostics
Horizon-Eval: evaluation-integrity framework for portable long-horizon agent benchmarks, with QA gates, trajectory auditing, replayable run bundles, and safety-gap analysis.
CLI release gate for structured AI changes.
Germany-based AI/backend engineer. Shipping production quality loops, build logs, and practical engineering notes.
Browser-based LLM evaluation tool. Test prompts and models on your own data with multi-model judge council, cost tracking, and ranked results.
Add a description, image, and links to the llm-evals topic page so that developers can more easily learn about it.
To associate your repository with the llm-evals topic, visit your repo's landing page and select "manage topics."