Memory Evaluation

AgentEval.Memory — the comprehensive .NET toolkit for evaluating an AI agent's memory: retention, recall depth, temporal reasoning, fact updates, cross-session persistence, and resistance to noise.

What LongMemEval (ICLR 2025) does for Python research, AgentEval.Memory does for production .NET — plus a curated benchmark suite, an HTML reporting engine, baseline tracking, and a fluent assertion API.

Status: ✅ Available in the current release. Ships in the AgentEval umbrella package as the AgentEval.Memory module.

Why Memory Evaluation Matters

Modern AI agents promise to remember — across turns, across sessions, across days. But conversation history grows, context windows fill, reducers/compactors compress, and AIContextProvider implementations vary. The hard questions are:

Does the agent still recall a fact mentioned 30 turns ago?
Does it remember what was said yesterday after the session was reset?
Does it abstain when it doesn't know — or hallucinate?
Does it correctly resolve conflicting facts told minutes apart?
Does it survive context-window compaction without losing critical state?

AgentEval.Memory answers these — quantitatively, repeatedly, in CI.

What Ships

🧠 Five Memory Metrics

Metric	What it measures
`MemoryRetentionMetric`	Can the agent recall a planted fact after intervening turns?
`MemoryReachBackMetric`	How far back can it reach as conversation depth grows?
`MemoryTemporalMetric`	Does it reason about when events happened, not just what?
`MemoryNoiseResilienceMetric`	Does it stay accurate when surrounded by distractor / red-herring turns?
`MemoryReducerFidelityMetric`	Does it survive context compaction / summarization without losing facts?

All metrics use an LLM judge (MemoryJudge) with type-specific prompts (synthesis, counterfactual, correction-chain, specificity-attack, …) calibrated to match the LongMemEval methodology.

🏆 Curated Benchmark Suite

Run a one-line benchmark and get a weighted score across multiple categories:

var runner = MemoryBenchmarkRunner.Create(chatClient);
var agent = chatClient.AsEvaluableAgent(name: "MemoryAgent", includeHistory: true);

var result = await runner.RunBenchmarkAsync(agent, MemoryBenchmark.Standard);
Console.WriteLine($"Memory score: {result.OverallScore:F1}% ({result.Grade})");

Three tiers + diagnostics:

Preset	Categories	Use For
`MemoryBenchmark.Quick`	3 (retention, temporal, noise)	Fast CI feedback
`MemoryBenchmark.Standard`	8 (adds reach-back, fact-update, multi-topic, abstention, preference)	Daily quality gate
`MemoryBenchmark.Full`	12 (adds cross-session, reducer fidelity, conflict resolution, multi-session reasoning)	Pre-release validation
`MemoryBenchmark.Diagnostic`	12 + ~50K-token context pressure	Deep limits analysis
`MemoryBenchmark.Overflow`	8 + 192K-token haystacks	Long-context stress

📊 HTML Reporting Engine — Pentagon Comparison

The most spectacular piece. Run your benchmark, save a baseline, then visually compare configurations and models with overlaid pentagon charts:

var store = new JsonFileBaselineStore();
await store.SaveAsync(result.ToBaseline(label: "GPT-4o-mini"));

// Generate an interactive HTML report comparing baselines
await result.ExportHtmlReportAsync("memory-report.html", new MemoryReportingOptions
{
    OverlayBaselines = await store.LoadAllAsync()
});

The report includes:

Pentagon overlay charts — see strengths and gaps across categories
Per-category scores with grades (A+ → F)
Baseline diffs — what changed since the last run
Model comparison — overlay GPT-4o-mini, GPT-4o, GPT-4.1 on the same chart
Drill-down — failing scenarios, judge explanations, response excerpts

🌍 LongMemEval — First-Class .NET Re-implementation

The biggest research-grade memory benchmark, fully re-implemented in .NET with the official methodology preserved:

var runner = LongMemEvalBenchmarkRunner.Create(chatClient, datasetPath);
var config = new AgentBenchmarkConfig { ConfigurationId = "my-agent", ModelId = "gpt-4o" };

var result = await runner.RunAsync(agent, config, new ExternalBenchmarkOptions
{
    MaxQuestions = 50,
    StratifiedSampling = true,         // proportional across all 6 question types
    PreserveSessionBoundaries = true,  // session markers in history
});

Console.WriteLine($"Score: {result.OverallAccuracy:F1}% (paper: GPT-4o = 57.7%)");

What's preserved from the official benchmark:

Stratified sampling across all 6 question types (single-session, multi-session, temporal, knowledge-update, abstention, preference)
Type-specific judge prompts matching the official evaluation
Session boundary + timestamp preservation in history injection
Binary scoring (0/1) comparable to published results
2 LLM calls per question (query + judge) via history injection

Sample G8: LongMemEvalBenchmark and G10: LongMemEvalBaselineRepro reproduce the GPT-4o paper baseline.

✍️ Fluent Memory Assertions

result.Should()
    .HaveRetentionAbove(80, because: "agent must recall planted facts")
    .HaveTemporalReasoningAbove(70)
    .HaveNoHallucinations()
    .HavePassedCategory("Cross-Session", because: "long-term memory required");

await agent.CanRememberAsync("My favorite language is C#")
    .Should().BeTrue();

🔌 Production DI Wiring

services.AddAgentEvalAll();          // includes Memory
// or selectively:
services.AddAgentEvalMemory();       // metrics + scenarios + reporting + temporal + evaluators

public class MemoryHealthCheck(IMemoryBenchmarkRunner runner) { /* … */ }

🔗 MAF Pipeline Compatibility

AgentEval.Memory works without modification with MAF 1.3.0's pipeline (ChatHistoryProvider, AIContextProvider, CompactionStrategy). It evaluates behavior, not mechanism. See MAF Memory Integration for the concept-mapping table.

Honest Caveats

We hold ourselves to the same evaluation rigor we ship. A few candid notes:

⚠️ The Native Benchmark Currently Scores High

Our curated Standard benchmark scores roughly 88–93% on GPT-4.1 in our reference runs. Strong models clear it comfortably — which is less useful as a discriminator than we want it to be.

Why: the native scenarios were initially designed to test retrieval (find a fact, return it) more than reasoning (synthesise fragments across sessions, resolve conflicts, infer unstated conclusions). Strong base models retrieve very well.

What we recommend today:

Use the native benchmark as a regression gate — track your own delta over time, not the absolute number.
Use LongMemEval (Sample G8 / G10) for cross-platform comparable numbers — it's calibrated to the published GPT-4o = 57.7% baseline.
Use the Diagnostic and Overflow presets to apply real context pressure.

Current limitation: the native benchmark is better suited to regression tracking than fine-grained model differentiation, especially when stronger models can solve many scenarios through retrieval alone.