Replies: 3 comments
-
|
This is excellent research! The curated context vs flat log comparison matches what we have seen in production at RevolutionAI (https://revolutionai.io). Our numbers were similar - curated context consistently outperforms naive approaches by 2-3x on accuracy. A few additional observations:
Would love to see this extended to multi-turn conversations where context management gets even more critical. Great work! |
Beta Was this translation helpful? Give feedback.
-
|
This is huge. The 43% vs 100% gap is dramatic and matches what we see in production. Why this happens (our hypothesis): Attention is not free - even with long context windows, the model is doing soft selection over all tokens. Noise competes for attention weights. More noise = more chances to attend to wrong information. Practical implications:
Code pattern we use: def curate_context(full_history: list, max_turns: int = 10):
# Keep system prompt
# Keep last N turns
# Summarize older content if relevant
# Drop noise explicitly
return curatedQuestion: Did you test intermediate positions? Like 50% curation? Curious where the inflection point is. This should be required reading for anyone building chat/agent systems. We have seen the same pattern at Revolution AI — context quality beats context quantity every time. |
Beta Was this translation helpful? Give feedback.
-
|
Great empirical validation! 43% → 100% just from context curation is striking. Why this matters for LlamaIndex users: Default retrieval often returns "relevant" chunks that include noise:
More tokens != better answers. Curation strategies in LlamaIndex:
Key insight: We see similar patterns at Revolution AI — aggressive reranking beats larger context every time. Great experiment! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Following up on the design paper posted here earlier — we ran an empirical test of the central claim.
Setup: A 78-turn conversation (~6,400 words) with natural noise accumulation (topic shifts, corrections, abandoned approaches, off-topic tangents) fed to llama3.1 8B via Ollama. 10 fact-retrieval questions, tested under two conditions: full flat-log context vs curated thread-only context.
Result: 43.3% accuracy (full context) vs 100% accuracy (curated context).
The model hallucinated facts, denied information existed in the conversation, and picked up contradicted details from noise. Exactly the failure modes predicted by the paper.
Full results, methodology, and reproducible code: github.com/MikeyBeez/fuzzyOS/discussions/2
The takeaway for retrieval/context systems: a well-curated short context dramatically outperforms a noisy long context, even when the long context is well within the model's window. Context selection dominates context length.
Beta Was this translation helpful? Give feedback.
All reactions