This is the code base for EMNLP 2025 accepted main conference paper "Time is Up! An Empirical Study of LLM Reasoning Ability Under Output Length Constraint" [arXiv, demos].
This project is built upon the open-sourced evaluation project Qwen2.5-Math, where you can find more details. We use the datasets, running scripts (like eval.sh and eval.py) and accuracy scoring mechanism from Qwen2.5-Math. Many thanks to the authors : )
We follow the environment setup as Qwen2.5-Math in requirements.
You can also refer to the configurations used in our paper:
# python version
python==3.10.15
# pkgs for LLM inference
datasets==3.1.0
tokenizers==0.20.3
torch==2.4.0
torchvision==0.19.0
tqdm==4.67.1
transformers==4.46.3
triton==3.0.0
vllm==0.6.3.post1 # this version is crucial for the performance
flash-attn==2.7.0.post2 # you should install flash-attn in the last step since it depends on the version of torch
# pkgs for math eval
sympy==1.12
antlr4-python3-runtime==4.11.1 # ! The version needs to be compatible with sympy.
word2number==1.1
Pebble==5.1.0
timeout-decorator==0.5.0
To study the reasoning ability of different LLMs under strict output token constraints, we make the following modifications:
To enable smooth construction of model input, we list all prompt styles and chat templates in prompts. Before inference, questions and system messages will inplace the placeholders in templates. This is done in the file utils.py
To impose strict output token limits onto LLM reasoning at a minimal overhead, we adopt a 2-stage inference procedure by leveraging the autoregressive characteristic of LLMs:
-
Intact Inference. Set the max new token parameter in inference API to a large number, e.g., 4096/8192, which should be large enough for LLMs completing inference within this limit on tested datasets, such as GSM8K and MATH. Then we run LLM reasoning on datasets and log the whole reasoning traces into local files.
-
Length-Constrained Inference. We then truncate the reasoning traces in stage 1 to a length of [token budget - 25], and append them to the end of original model input. Finally, we use the concatenated input as new input message for LLM reasoning. But this time the max new token parameter in inference API is set exactly to 25. The final answer is extracted from the output of the second stage inference, which should not be longer than 25 tokens.
The logging, truncation and concatenation of reasoning traces are done in the files eval.py and utils.py. Since we have formulated the input of different prompt styles in prompts.py and utils.py, this two-stage inference is unidifed and easy to implement.
Apart from the original datasets provided in Qwen2.5-Math data, we also add two datasets:
- mmlu_stem: a subset of STEM subjects (such as astronomy and biology) defined in MMLU.
- ACPBench: contains both single and multi step reasoning tasks for evaluating actions and plans.
The analysis and plotting scripts are provided in eval/. We also upload the running scripts used in this paper in sh/, which can be referenced for launching experiments successfully.
@misc{sun2025empiricalstudyllmreasoning,
title={An Empirical Study of LLM Reasoning Ability Under Strict Output Length Constraint},
author={Yi Sun and Han Wang and Jiaqiang Li and Jiacheng Liu and Xiangyu Li and Hao Wen and Yizhen Yuan and Huiwen Zheng and Yan Liang and Yuanchun Li and Yunxin Liu},
year={2025},
eprint={2504.14350},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2504.14350},
}