Experiment: Curated context (100%) vs flat log (43%) on llama3.1 8B #20680

MikeyBeez · 2026-02-11T16:57:45Z

MikeyBeez
Feb 11, 2026

Following up on the design paper posted here earlier — we ran an empirical test of the central claim.

Setup: A 78-turn conversation (~6,400 words) with natural noise accumulation (topic shifts, corrections, abandoned approaches, off-topic tangents) fed to llama3.1 8B via Ollama. 10 fact-retrieval questions, tested under two conditions: full flat-log context vs curated thread-only context.

Result: 43.3% accuracy (full context) vs 100% accuracy (curated context).

The model hallucinated facts, denied information existed in the conversation, and picked up contradicted details from noise. Exactly the failure modes predicted by the paper.

Full results, methodology, and reproducible code: github.com/MikeyBeez/fuzzyOS/discussions/2

The takeaway for retrieval/context systems: a well-curated short context dramatically outperforms a noisy long context, even when the long context is well within the model's window. Context selection dominates context length.

xXMrNidaXx · 2026-02-23T12:58:55Z

xXMrNidaXx
Feb 23, 2026

This is excellent research! The curated context vs flat log comparison matches what we have seen in production at RevolutionAI (https://revolutionai.io).

Our numbers were similar - curated context consistently outperforms naive approaches by 2-3x on accuracy. A few additional observations:

Caching helps massively - RAGCache-style embedding caching cuts latency while maintaining quality
Context window management is the bottleneck - smart summarization > truncation
Smaller models + better context often beats larger models + noisy context

Would love to see this extended to multi-turn conversations where context management gets even more critical. Great work!

0 replies

xXMrNidaXx · 2026-02-23T13:25:58Z

xXMrNidaXx
Feb 23, 2026

This is huge. The 43% vs 100% gap is dramatic and matches what we see in production.

Why this happens (our hypothesis):

Attention is not free - even with long context windows, the model is doing soft selection over all tokens. Noise competes for attention weights. More noise = more chances to attend to wrong information.

Practical implications:

RAG chunk selection matters more than chunk count
- Better to retrieve 3 highly relevant chunks than 10 "maybe relevant" ones
Conversation compaction is essential
- Summarize periodically rather than appending infinitely
- Remove abandoned threads explicitly
Recency bias is your friend
- Recent context should dominate over old context
- Explicit "forget this" markers help

Code pattern we use:

def curate_context(full_history: list, max_turns: int = 10):
    # Keep system prompt
    # Keep last N turns
    # Summarize older content if relevant
    # Drop noise explicitly
    return curated

Question: Did you test intermediate positions? Like 50% curation? Curious where the inflection point is.

This should be required reading for anyone building chat/agent systems. We have seen the same pattern at Revolution AI — context quality beats context quantity every time.

0 replies

xXMrNidaXx · 2026-02-23T14:39:45Z

xXMrNidaXx
Feb 23, 2026

Great empirical validation! 43% → 100% just from context curation is striking.

Why this matters for LlamaIndex users:

Default retrieval often returns "relevant" chunks that include noise:

Contradicted statements
Abandoned approaches
Off-topic tangents

More tokens != better answers.

Curation strategies in LlamaIndex:

Reranking with relevance threshold

from llama_index.core.postprocessor import SentenceTransformerRerank

reranker = SentenceTransformerRerank(
    model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    top_n=3,  # Aggressively filter
)

Temporal decay for conversations

# Prefer recent context over old
def decay_score(node, recency_weight=0.8):
    age = now - node.metadata["timestamp"]
    return node.score * (recency_weight ** age.hours)

Contradiction detection pre-filter

# Remove nodes that contradict more recent nodes
def filter_contradictions(nodes):
    # NLI model to detect contradictions
    ...

Key insight:
Context window is a ceiling, not a target. Fill with signal, not noise.

We see similar patterns at Revolution AI — aggressive reranking beats larger context every time. Great experiment!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Curated context (100%) vs flat log (43%) on llama3.1 8B #20680

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Experiment: Curated context (100%) vs flat log (43%) on llama3.1 8B #20680

Uh oh!

MikeyBeez Feb 11, 2026

Replies: 3 comments

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

Uh oh!

xXMrNidaXx Feb 23, 2026

MikeyBeez
Feb 11, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026

xXMrNidaXx
Feb 23, 2026