[Fix] Cap max tokens to prevent potential OOM (#3720)

yiz-liu · web-flow · commit 1f25d60870f2 · 2025-10-25T11:23:21.000+08:00
### What this PR does / why we need it? Caps the calculated maximum number of tokens at 512. This prevents allocating an excessively large buffer when a cudagraph capture size is not specified, mitigating the risk of out-of-memory errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? None. - vLLM version: v0.11.0rc3 - vLLM main: vllm-project/vllm@17c540a Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
diff --git a/vllm_ascend/worker/model_runner_v1.py b/vllm_ascend/worker/model_runner_v1.py
@@ -586,7 +586,9 @@ def _init_mc2_tokens_capacity(self):
         if self.compilation_config.cudagraph_capture_sizes:
             max_num_tokens = self.compilation_config.cudagraph_capture_sizes[0]
         else:
-            max_num_tokens = self.max_num_reqs * self.uniform_decode_query_len
+            # NOTE: To save memory, we cap the max number of tokens to 512.
+            max_num_tokens = min(
+                self.max_num_reqs * self.uniform_decode_query_len, 512)
         tp_size = self.parallel_config.tensor_parallel_size
         # Use integer arithmetic for ceiling division.
         num_tokens_per_tp_rank = (max_num_tokens + tp_size - 1) // tp_size