Overview

This is the code base for EMNLP 2025 accepted main conference paper "Time is Up! An Empirical Study of LLM Reasoning Ability Under Output Length Constraint" [arXiv, demos].

This project is built upon the open-sourced evaluation project Qwen2.5-Math, where you can find more details. We use the datasets, running scripts (like eval.sh and eval.py) and accuracy scoring mechanism from Qwen2.5-Math. Many thanks to the authors : )

Setup

We follow the environment setup as Qwen2.5-Math in requirements.

You can also refer to the configurations used in our paper:

# python version
python==3.10.15

# pkgs for LLM inference 
datasets==3.1.0
tokenizers==0.20.3
torch==2.4.0
torchvision==0.19.0
tqdm==4.67.1
transformers==4.46.3
triton==3.0.0
vllm==0.6.3.post1   # this version is crucial for the performance
flash-attn==2.7.0.post2    # you should install flash-attn in the last step since it depends on the version of torch

# pkgs for math eval
sympy==1.12
antlr4-python3-runtime==4.11.1 # ! The version needs to be compatible with sympy.
word2number==1.1
Pebble==5.1.0
timeout-decorator==0.5.0

What we have modified

To study the reasoning ability of different LLMs under strict output token constraints, we make the following modifications:

Prompts

To enable smooth construction of model input, we list all prompt styles and chat templates in prompts. Before inference, questions and system messages will inplace the placeholders in templates. This is done in the file utils.py

Inference procedure

To impose strict output token limits onto LLM reasoning at a minimal overhead, we adopt a 2-stage inference procedure by leveraging the autoregressive characteristic of LLMs:

Intact Inference. Set the max new token parameter in inference API to a large number, e.g., 4096/8192, which should be large enough for LLMs completing inference within this limit on tested datasets, such as GSM8K and MATH. Then we run LLM reasoning on datasets and log the whole reasoning traces into local files.
Length-Constrained Inference. We then truncate the reasoning traces in stage 1 to a length of [token budget - 25], and append them to the end of original model input. Finally, we use the concatenated input as new input message for LLM reasoning. But this time the max new token parameter in inference API is set exactly to 25. The final answer is extracted from the output of the second stage inference, which should not be longer than 25 tokens.

The logging, truncation and concatenation of reasoning traces are done in the files eval.py and utils.py. Since we have formulated the input of different prompt styles in prompts.py and utils.py, this two-stage inference is unidifed and easy to implement.

Additional Datasets

Apart from the original datasets provided in Qwen2.5-Math data, we also add two datasets:

mmlu_stem: a subset of STEM subjects (such as astronomy and biology) defined in MMLU.
ACPBench: contains both single and multi step reasoning tasks for evaluating actions and plans.

Supplementary files

The analysis and plotting scripts are provided in eval/. We also upload the running scripts used in this paper in sh/, which can be referenced for launching experiments successfully.

Citation

@misc{sun2025empiricalstudyllmreasoning,
  title={An Empirical Study of LLM Reasoning Ability Under Strict Output Length Constraint},
  author={Yi Sun and Han Wang and Jiaqiang Li and Jiacheng Liu and Xiangyu Li and Hao Wen and Yizhen Yuan and Huiwen Zheng and Yan Liang and Yuanchun Li and Yunxin Liu},
  year={2025},
  eprint={2504.14350},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2504.14350},
}

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
data		data
env_requirements		env_requirements
eval		eval
latex2sympy		latex2sympy
plot_old		plot_old
sh		sh
tests		tests
utils		utils
.gitignore		.gitignore
Air.jpg		Air.jpg
LICENSE		LICENSE
README.md		README.md
README_Qwen2.5Math.md		README_Qwen2.5Math.md
README_eval.md		README_eval.md
data_loader.py		data_loader.py
eval.py		eval.py
evaluate.py		evaluate.py
examples.py		examples.py
grader.py		grader.py
math_utils.py		math_utils.py
model_utils.py		model_utils.py
original_parser.py		original_parser.py
prompts.py		prompts.py
python_executor.py		python_executor.py
requirements.txt		requirements.txt
rm_maj_eval.py		rm_maj_eval.py
trajectory.py		trajectory.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Setup

What we have modified

Prompts

Inference procedure

Additional Datasets

Supplementary files

Citation

About

Uh oh!

Releases

Packages

Languages

License

miniTsl/Time-Constrained-CoT

Folders and files

Latest commit

History

Repository files navigation

Overview

Setup

What we have modified

Prompts

Inference procedure

Additional Datasets

Supplementary files

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages