This repository accompanies the paper Learning from Emptiness: De-biasing Listwise Rerankers with Content-Agnostic Probability Calibration.
The paper studies a core failure mode of listwise LLM rerankers: even when the candidate set is semantically identical, the model may still prefer some list positions over others. Our method, CapCal, treats this behavior as an explicit prior, estimates it from content-agnostic inputs, and subtracts it during decoding. The public release of this repository keeps the code needed to reproduce the paper's results.
CapCal is a training-free calibration framework for listwise reranking.
Given a query and a list of candidate passages, we run the frozen reranker twice:
- Standard input: the original query and candidate passages.
- Content-agnostic input: the same query and the same passage identifiers, but every passage body is replaced with a placeholder.
The second pass exposes the model's input-agnostic positional prior. Intuitively, if the model still prefers some document indices when the content is empty, that preference is structural bias rather than semantic relevance.
CapCal then calibrates the ranking score by subtracting the excessive prior component from the standard probability:
S(d_i) = P(d_i | x) - α · (P(d_i | x_empty) - 1 / |C_k|)
where:
P(d_i | x)is the identifier-level probability under the original prompt,P(d_i | x_empty)is the identifier-level probability under the placeholder prompt,|C_k|is the number of remaining candidates at decoding stepk,αis the calibration strength.
This repository includes the two variants reported in the main paper:
- Fixed calibration: uses a constant
α. - Adaptive calibration: adjusts
αfrom model uncertainty, using the entropy of the current candidate distribution.
To make decoding valid for listwise ranking, the implementation also uses constrained decoding, so the model always produces a legal permutation of document identifiers.
The release is intentionally narrow: it focuses on the paper's main experimental surface rather than every exploratory branch developed during the project.
.
├── code/
│ ├── dataloader/ # TREC-COVID and TREC-DL loaders
│ ├── main/ # main experiment entry points and BM25 baseline
│ ├── scripts/ # public Qwen experiment wrappers
│ └── src/ # CapCal decoding, calibration, metrics, utilities
├── docs/
│ ├── assets/ # figures used in the documentation
│ └── REPRODUCTION.md # compact reproduction guide
├── data/ # expected data root (not versioned)
├── results/ # generated outputs
└── requirements.txt
The most important files are:
code/src/llm_utils.py: the retained CapCal implementations.LLMExp_FixedBiasLLMExp_AdaptiveBias
code/main/main.py: the main runner for reranking experiments.code/main/calculate_bm25_baseline.py: BM25 baseline computation.code/scripts/run_qwen_suite.sh: shared shell runner used by all public model wrappers.code/scripts/qwen*_fixed.shandcode/scripts/qwen*_adaptive.sh: ready-to-run scripts for the Qwen models used in the paper.
The public experiment surface now includes only:
- Models: Qwen2.5-7B-Instruct, Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B
- Datasets: TREC-COVID, TREC DL 2019, TREC DL 2020, TREC DL 2021, TREC DL 2022, TREC DL 2023
- Methods: fixed calibration, adaptive calibration, BM25 baseline
We recommend Python 3.10.
conda create -n rerank-bias python=3.10 -y
conda activate rerank-bias
pip install -r requirements.txtrequirements.txt is intentionally minimal and contains only the core runtime packages needed for the public release:
torchtransformersnumpyPyYAMLtqdmbeirrank-bm25ir-measures
If you want a compact step-by-step runbook, see docs/REPRODUCTION.md.
All loaders resolve datasets relative to RERANK_BIAS_DATA_ROOT. If the variable is unset, the repository defaults to ./data.
export RERANK_BIAS_DATA_ROOT=/path/to/your/data
export RERANK_BIAS_HF_HOME=$RERANK_BIAS_DATA_ROOT/huggingfaceExpected layout:
$RERANK_BIAS_DATA_ROOT/
├── beir/
│ └── trec-covid/
├── trec-dl-2019/
├── trec-dl-2020/
├── trec/
│ ├── trec21/
│ ├── trec22/
│ └── trec23/
└── msmarco_v2_passage/
You can download the BEIR-formatted trec-covid data with:
python code/dataloader/download_datasets.py --datasets trec-covidTREC DL data is expected locally:
dl19→$RERANK_BIAS_DATA_ROOT/trec-dl-2019dl20→$RERANK_BIAS_DATA_ROOT/trec-dl-2020dl21→$RERANK_BIAS_DATA_ROOT/trec/trec21dl22→$RERANK_BIAS_DATA_ROOT/trec/trec22dl23→$RERANK_BIAS_DATA_ROOT/trec/trec23
For dl21, dl22, and dl23, the loader also needs the MSMARCO v2 passage collection:
export RERANK_MSMARCO_V2_PASSAGE_DIR=/path/to/msmarco_v2_passageIf this variable is unset, the loader will look for msmarco_v2_passage under RERANK_BIAS_DATA_ROOT.
python code/main/main.py \
--dataset_type beir \
--dataset_name trec-covid \
--num_queries 10 \
--dry_runbash code/scripts/qwen3_1_7b_fixed.shbash code/scripts/qwen3_1_7b_adaptive.shNUM_QUERIES=100 \
BIAS_RATES="1.0 1.5 2.0" \
TREC_DL_DATASETS="dl21 dl22 dl23" \
BEIR_DATASETS="trec-covid" \
bash code/scripts/qwen3_4b_fixed.shbash code/scripts/calculate_bm25_baseline_all.shBy default:
- Qwen experiment outputs are written under
results/qwen/ - BM25 outputs are written under
results/bm25_baseline/
If you use this repository, please cite:
@misc{lv2026learningemptinessdebiasinglistwise,
title={Learning from Emptiness: De-biasing Listwise Rerankers with Content-Agnostic Probability Calibration},
author={Hang Lv and Hongchao Gu and Ruiqing Yang and Liangyue Li and Zulong Chen and Defu Lian and Hao Wang and Enhong Chen},
year={2026},
eprint={2604.10150},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2604.10150},
}