Skip to content
/ ERGO Public

ERGO is a model-agnostic inference-time framework that helps LLMs recover from context degradation and maintain high performance across multi-turn conversations.

Notifications You must be signed in to change notification settings

haziq-exe/ERGO

Repository files navigation

ERGO: Entropy-guided Resetting for Generation Optimization


ERGO Banner

Paper Python

Tests

Paper WebsiteQuick StartRepository StructureResultsContact


Overview

ERGO introduces a paradigm shift in handling multi-turn LLM conversations by treating uncertainty as a first-class signal. When large language models get "lost" in extended conversations, ERGO detects these moments through entropy spikes and strategically resets the context, recovering both accuracy and reliability. This repository contains all code necessary to replicate our experiments and evaluate ERGO’s performance across a suite of models and multi-turn generation tasks.

Core Results

56.6%

Average Performance Gain

24.7%

Peak Capability Increase

35.3%

Decrease in Unreliability

Quick Start

Prerequisites

# Clone the repository
git clone https://github.com/haziq-exe/ERGO.git
cd ERGO
pip install -r requirements.txt
  • To use OpenAI models you will need the environment variable "OPENAI_KEY" to be set to your key.

  • You will need to downloaded the following sharded dataset from Laban et al

Basic Usage

from experiments.runExperiment import RunExperiment

# Initialize experiment with your chosen model
experiment = RunExperiment(
    model_name="HuggingFaceTB/SmolLM-135M-Instruct",
    device="cpu",
    device_map=None,
    max_new_tokens=1000
)

# Run ERGO on GSM8K dataset
experiment.run_GSM8K(
    dataset_path="sharded_dataset.json", # path to sharded dataset from Laban et al.
    num_Qs=20,
    num_runs=1,
    threshold=0.5,
    output_path="outputs/gsm8k_example.json"
)

Run from root directory:

python -m main.example_main

Repository Structure

ERGO/
│
├── evaluation/         # Evaluation metrics and scoring
│   └── evaluator.py
|   └── utils.py 
|   └── eval.bfcl.py    # Taken from Laban et al.
│
├── core/               # Core ERGO implementation
│   ├── dataset.py         
│   ├── model.py          
│   └── utils.py          
│
├── experiments/        # Experiment runner
│   └── runExperiment.py  
│
├── generation/         # Generate with ERGO
│   └── generator.py
│
└── main/              # Example scripts
    └── example_main.py

Evaluated Tasks

ERGO has been rigorously tested across five diverse generation tasks:

Task Dataset Description Metric
Math GSM8K Elementary math word problems Exact Match
Code LiveCodeBench Python function generation Test Suite Pass
SQL Spider Text-to-SQL query generation Query Accuracy
API Calls Berkeley FCL Function calling from instructions Call Validity
Data-to-Text ToTTo Table caption generation BLEU Score

Key Results

Average Performance Across Models

Model FULL SHARDED ERGO Relative Improvement
GPT-4o 79.2 51.4 74.1 +44.2%
GPT-4.1 83.6 56.6 77.0 +36.0%
GPT-4o-mini 73.8 44.3 71.8 +62.1%
Phi-4 64.6 36.4 59.2 +62.6%
LLaMA-3.1-8B 46.0 28.7 50.9 +77.4%

Citation

If you use ERGO in your research, please cite our paper:

@inproceedings{mohammad-khalid-etal-2025-ergo,
    title = "{ERGO}: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models",
    author = "Mohammad Khalid, Haziq  and
      Jeyaganthan, Athikash  and
      Do, Timothy  and
      Fu, Yicheng  and
      Sharma, Vasu  and
      O{'}Brien, Sean  and
      Zhu, Kevin",
    editor = "Noidea, Noidea",
    booktitle = "Proceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025)",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.uncertainlp-main.23/",
    pages = "273--286",
    ISBN = "979-8-89176-349-4"
}

Contact

Lead Author: Haziq Mohammad Khalid
📧 haziqkhalid04@gmail.com

Co-Author: Timothy Do
📧 tim.do.info@gmail.com

Code References


Back to Top

About

ERGO is a model-agnostic inference-time framework that helps LLMs recover from context degradation and maintain high performance across multi-turn conversations.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages