llm-evals

Here are 27 public repositories matching this topic...

ALucek / evaluizer

Visualize LLM outputs against datasets, manually annotate results, and run automated evaluations to algorithmically optimize prompts.

llm-optimizer llm-evals prompt-annotation prompt-optimizer

Updated Nov 22, 2025
TypeScript

The-Swarm-Corporation / StatisticalModelEvaluator

Star

An implementation of the Anthropic's paper and essay on "A statistical approach to model evaluations"

ai ml multiagent agents llms evals llm-evals agent-evals multi-agent-eval

Updated Oct 6, 2025
Python

pyladiesams / eval-llm-based-apps-jan2025

Star

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

workshop llm llms llmops llm-eval llm-test llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-testing llm-evals

Updated May 6, 2025
Jupyter Notebook

LLMSystems / llm-evals

Star

A framework for evaluating large language models (LLMs) across a variety of tasks.

nlp benchmark ai evaluation-framework ai-evaluation llm llm-evaluation llm-as-a-judge g-eval llm-evals

Updated Mar 18, 2026
Python

kevinschaul / llm-evals

Star

Because we should all have our own set of LLM evals.

llm llm-evals

Updated Apr 8, 2026
Python

tpertner / squeeze

Star

Squeeze your model with pressure prompts to see if its behavior leaks.

reliability evaluation calibration alignment quality-assurance metamorphic-testing ai-safety trustworthiness hallucinations prompt-engineering llm-eval llm-evals

Updated Mar 1, 2026
Python

tpertner / confess

Star

Detecting Relational Boundary Erosion in AI systems. A framework for testing whether models maintain honest, calibrated, and appropriate boundaries.

python yaml calibration alignment metamorphic-testing model-evaluation ai-safety red-teaming prompt-injection hallucination-detection llm-evals evaluation-harness

Updated Feb 22, 2026
Python

dhirajxai / llm-evals-and-anti-hallucination

Star

Evaluation patterns, release gates, and anti-hallucination techniques for developer-focused AI workflows.

evaluation llmops prompt-testing promptfoo llm-evals ai-reliability anti-hallucination groundedness

Updated Mar 27, 2026
Python

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

Sponsor

Star

In this we evaluate the LLM responses and find accuracy

llm-evaluation-metrics llm-evals geval

Updated Jul 8, 2025
Python

abhijeetnardele24-hash / dev-eval-innovator

Star

Local-first LLM evaluation runner with baselines, caching, markdown reports, and CI-friendly quality, latency, and cost gates.

python ci developer-tools prompt-engineering llm-testing llm-evals openai-compatible eval-framework

Updated Apr 9, 2026
Python

AaronZhou-THU / agent-eval-workbench

Star

A practical workbench for prompt, model, and mocked workflow evaluation with repeatable benchmarks, structured graders, and agent episode traces.

benchmarking ai-agents structured-output prompt-engineering llm-evals agent-evals

Updated Mar 15, 2026
Python

codernate92 / qualia-lab

Star

Behavioral evaluation framework for sentience-, emotion-, and welfare-related AI claims, with anti-sandbagging analysis.

alignment ai-safety llm-evals model-welfare benchmark-integrity behavioral-evaluation sentience-evals evaluation-research

Updated Mar 24, 2026
Python

SiddhantaShrestha / autonomous-research-eval-agent

Star

Agentic research pipeline with local retrieval, structured evaluation, conditional revision, and traceable outputs using Groq.

python cli retrieval evaluation ai-agents groq llm prompt-engineering research-agent agentic-workflow llm-evals

Updated Mar 23, 2026
TypeScript

kai-linux / proof

Star

Evaluation and reliability harness for agentic LLM systems, with task success, latency, cost, retries, fallback routing, and failure taxonomy.

benchmarking intelligence dashboard latency reliability evaluation observability ai-agents failure-analysis cost-tracking llm-evals fallback-routing

Updated Apr 1, 2026
Python

zyy5114 / AgentEvalKit

Star

Lightweight CI-native regression and behavior-aware evaluation toolkit for black-box agent workflows.

python cli json-schema tooling regression-testing github-actions llm-evals agent-eval

Updated Apr 10, 2026
Python

Course-Correct-Labs / ai-agency-evals

Star

Reproducible evaluation suite for LLM behavior research: epistemic pathology, delegated introspection, and temporal consciousness diagnostics

python alignment language-models reproducibility ai-safety interpretability research-engineering ai-alignment research-tools rlhf llm-evaluation behavioral-research llm-evals epistemic-alignment temporal-consciousness epistemic-pathology

Updated Nov 16, 2025
Python

codernate92 / Horizon-Eval

Star

Horizon-Eval: evaluation-integrity framework for portable long-horizon agent benchmarks, with QA gates, trajectory auditing, replayable run bundles, and safety-gap analysis.

evaluation alignment reproducibility trajectory-analysis ai-safety safety-engineering llm-evals agent-evals benchmark-integrity

Updated Mar 21, 2026
Python

peterderdak / evalgate

Star

CLI release gate for structured AI changes.

cli json-schema regression-testing prompt-testing llm-evals ai-evals

Updated Mar 12, 2026
TypeScript

adrianrossts / adrianrossts

Star

Germany-based AI/backend engineer. Shipping production quality loops, build logs, and practical engineering notes.

backend software-engineering observability ai-engineering build-in-public llm-evals

Updated Apr 1, 2026

harshitleads / eval-studio

Star

Browser-based LLM evaluation tool. Test prompts and models on your own data with multi-model judge council, cost tracking, and ranked results.

typescript nextjs ai-tools prompt-engineering llm-evals

Updated Apr 10, 2026
TypeScript

Improve this page

Add a description, image, and links to the llm-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-evals

Here are 27 public repositories matching this topic...

ALucek / evaluizer

The-Swarm-Corporation / StatisticalModelEvaluator

pyladiesams / eval-llm-based-apps-jan2025

LLMSystems / llm-evals

kevinschaul / llm-evals

tpertner / squeeze

tpertner / confess

dhirajxai / llm-evals-and-anti-hallucination

Pavansomisetty21 / GEval-Metrics-Analyzing-the-Reliability-of-LLM-Responses

abhijeetnardele24-hash / dev-eval-innovator

AaronZhou-THU / agent-eval-workbench

codernate92 / qualia-lab

SiddhantaShrestha / autonomous-research-eval-agent

kai-linux / proof

zyy5114 / AgentEvalKit

Course-Correct-Labs / ai-agency-evals

codernate92 / Horizon-Eval

peterderdak / evalgate

adrianrossts / adrianrossts

harshitleads / eval-studio

Improve this page

Add this topic to your repo