I’m using this command to evaluate my fine-tuned 7B model on AIME24 “no figures”:
lm_eval --model vllm
--model_args pretrained=../../train/ckpts/s1-20250509_212555,dtype=float32,tensor_parallel_size=1
--tasks aime24_nofigures
--batch_size auto
--apply_chat_template
--output_path s1.1forcingauto
--log_samples
--gen_kwargs "max_gen_toks=32768,max_tokens_thinking=auto"
It loads in about 7 seconds but then takes roughly 20 minutes just to process the first example, whereas evaluating the original Qwen-7B model completes in just a couple of minutes under the same settings. If I switch to dtype=float16, throughput improves dramatically but at the cost of noticeable accuracy degradation. Is there something I’m missing in my configuration or checkpoint structure that could explain this drastic slowdown under float32?