-
-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Description
Your current environment
The output of python collect_env.py
==============================
vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.17.0rc1.dev168+gdc6b57846.d20260309 (git sha: dc6b57846, date: 20260309)
vLLM Build Flags:
CUDA Archs: 12.1a; ROCm: Disabled
GPU Topology:
GPU0 NIC0 NIC1 NIC2 NIC3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NODE NODE NODE NODE 0-19 0 N/A
NIC0 NODE X PIX NODE NODE
NIC1 NODE PIX X NODE NODE
NIC2 NODE NODE NODE X PIX
NIC3 NODE NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: rocep1s0f0
NIC1: rocep1s0f1
NIC2: roceP2p1s0f0
NIC3: roceP2p1s0f1
🐛 Describe the bug
When serving Qwen-3.5-397B-A17B with --max-model-len 65536, it works well. But when serving with --max-model-len 262144, I get a key error KeyError: 'language_model.model.layers.20.linear_attn'.
Interestingly, --max-model-len 65536 no longer functions after running the 262144 len version. I have to reboot my entire cluster to get it to work again.
I am running on 3x DGX Spark with GB10 (total VRAM 360GB). vLLM is run via a modified spark-vllm-docker image here is the image I am using. I am running with TP=1 and PP=3.
I can believe that all the dumb mods I made to make NCCL / that docker image / vLLM's Distributed Executor Backend (see Issue #35848) is causing something to mess up, but I find it strange that it works at first but then breaks.
I believe its not a memory issue, since vLLM notes that the max concurrency for 262,144 tokens p/r is around 19.
The error message is pasted below:
(EngineCore_DP0 pid=1346) INFO 03-10 10:41:15 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 19.13x
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] EngineCore failed to start.
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] Traceback (most recent call last):
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1085, in run_engine_core
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] return func(*args, **kwargs)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 843, in __init__
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] super().__init__(
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] return func(*args, **kwargs)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 281, in _initialize_kv_caches
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 117, in initialize_from_config
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] self.collective_rpc("initialize_from_config", args=(kv_cache_configs,))
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_executor.py", line 510, in collective_rpc
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] return ray.get(ray_worker_outputs, timeout=timeout)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] return fn(*args, **kwargs)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] return func(*args, **kwargs)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 2981, in get
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] values, debugger_breakpoint = worker.get_objects(
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/ray/_private/worker.py", line 1012, in get_objects
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] raise value.as_instanceof_cause()
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ray.exceptions.RayTaskError(KeyError): ray::RayWorkerWrapper.execute_method() (pid=535, ip=192.168.3.107, actor_id=f2e01c90d280ed747d33ab2201000000, repr=<vllm.v1.executor.ray_utils.RayWorkerWrapper object at 0xeb7a0da3cef0>)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 75, in execute_method
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] raise e
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/ray_utils.py", line 65, in execute_method
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] return run_method(self, method, args, kwargs)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] return func(*args, **kwargs)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 310, in initialize_from_config
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] self.worker.initialize_from_config(kv_cache_config) # type: ignore
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] return func(*args, **kwargs)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 557, in initialize_from_config
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] self.model_runner.initialize_kv_cache(kv_cache_config)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 6435, in initialize_kv_cache
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] self.initialize_attn_backend(kv_cache_config)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5872, in initialize_attn_backend
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] attn_backends = get_attn_backends_for_group(kv_cache_group_spec)
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5831, in get_attn_backends_for_group
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] attn_backend = layers[layer_name].get_attn_backend()
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] ~~~~~~^^^^^^^^^^^^
(EngineCore_DP0 pid=1346) ERROR 03-10 10:41:15 [core.py:1111] KeyError: 'language_model.model.layers.20.linear_attn'
Happy to attach more information.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.