• Paper Website • Quick Start • Repository Structure • Results • Contact •
ERGO introduces a paradigm shift in handling multi-turn LLM conversations by treating uncertainty as a first-class signal. When large language models get "lost" in extended conversations, ERGO detects these moments through entropy spikes and strategically resets the context, recovering both accuracy and reliability. This repository contains all code necessary to replicate our experiments and evaluate ERGO’s performance across a suite of models and multi-turn generation tasks.
# Clone the repository
git clone https://github.com/haziq-exe/ERGO.git
cd ERGO
pip install -r requirements.txt-
To use OpenAI models you will need the environment variable "OPENAI_KEY" to be set to your key.
-
You will need to downloaded the following sharded dataset from Laban et al
from experiments.runExperiment import RunExperiment
# Initialize experiment with your chosen model
experiment = RunExperiment(
model_name="HuggingFaceTB/SmolLM-135M-Instruct",
device="cpu",
device_map=None,
max_new_tokens=1000
)
# Run ERGO on GSM8K dataset
experiment.run_GSM8K(
dataset_path="sharded_dataset.json", # path to sharded dataset from Laban et al.
num_Qs=20,
num_runs=1,
threshold=0.5,
output_path="outputs/gsm8k_example.json"
)Run from root directory:
python -m main.example_mainERGO/
│
├── evaluation/ # Evaluation metrics and scoring
│ └── evaluator.py
| └── utils.py
| └── eval.bfcl.py # Taken from Laban et al.
│
├── core/ # Core ERGO implementation
│ ├── dataset.py
│ ├── model.py
│ └── utils.py
│
├── experiments/ # Experiment runner
│ └── runExperiment.py
│
├── generation/ # Generate with ERGO
│ └── generator.py
│
└── main/ # Example scripts
└── example_main.py
ERGO has been rigorously tested across five diverse generation tasks:
| Task | Dataset | Description | Metric |
|---|---|---|---|
| Math | GSM8K | Elementary math word problems | Exact Match |
| Code | LiveCodeBench | Python function generation | Test Suite Pass |
| SQL | Spider | Text-to-SQL query generation | Query Accuracy |
| API Calls | Berkeley FCL | Function calling from instructions | Call Validity |
| Data-to-Text | ToTTo | Table caption generation | BLEU Score |
| Model | FULL | SHARDED | ERGO | Relative Improvement |
|---|---|---|---|---|
| GPT-4o | 79.2 | 51.4 | 74.1 | +44.2% |
| GPT-4.1 | 83.6 | 56.6 | 77.0 | +36.0% |
| GPT-4o-mini | 73.8 | 44.3 | 71.8 | +62.1% |
| Phi-4 | 64.6 | 36.4 | 59.2 | +62.6% |
| LLaMA-3.1-8B | 46.0 | 28.7 | 50.9 | +77.4% |
If you use ERGO in your research, please cite our paper:
@inproceedings{mohammad-khalid-etal-2025-ergo,
title = "{ERGO}: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models",
author = "Mohammad Khalid, Haziq and
Jeyaganthan, Athikash and
Do, Timothy and
Fu, Yicheng and
Sharma, Vasu and
O{'}Brien, Sean and
Zhu, Kevin",
editor = "Noidea, Noidea",
booktitle = "Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.uncertainlp-main.23/",
pages = "273--286",
ISBN = "979-8-89176-349-4"
}Lead Author: Haziq Mohammad Khalid
📧 haziqkhalid04@gmail.com
Co-Author: Timothy Do
📧 tim.do.info@gmail.com
- Lost in Conversation (Laban et al) — code accompanying the paper LLMs Get Lost in Multi-Turn Conversation
https://github.com/microsoft/lost_in_conversation
