-
-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Description
Your current environment
Details
GPU: RTX 5090 32 GBvLLM image: vllm/vllm-openai:cu130-nightly
Observed engine version from logs: v0.16.1rc1.dev206+g097eb544e
CUDA in container: cu130 image
Quantization: AWQ (awq_marlin)
Tensor parallel size: 1
Mode: OpenAI-compatible server via Docker
π Describe the bug
When serving Qwen3.5 AWQ models with vLLM on an RTX 5090 (32GB VRAM), the model loads successfully, but crashes when the first inference request is sent.
The server starts normally and finishes loading weights, but as soon as a request hits /v1/chat/completions, the engine crashes with:
RuntimeError: Triton Error [CUDA]: out of memory
Important observation:
- Model loads successfully
- Crash only happens during inference
- Happens with very small prompts
- Happens with small models as well (9B AWQ)
cmd used
docker run --gpus all \
-p 8888:8888 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:cu130-nightly \
QuantTrio/Qwen3.5-27B-AWQ \
--port 8888 \
--tensor-parallel-size 1 \
--reasoning-parser qwen3 \
--max-model-len 4096 \
--max-num-batched-tokens 4096 \
--enforce-eager
Example request:
{
"model": "QuantTrio/Qwen3.5-27B-AWQ",
"messages": [
{"role": "user", "content": "Hello"}
],
"max_tokens": 64
}
model tested:
- QuantTrio/Qwen3.5-27B-AWQ
- QuantTrio/Qwen3.5-9B-AWQ
- ykarout/Qwen3.5-9b-nvfp4
- cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit
Log:
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e'
"(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e'
"(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
(APIServer pid=1) INFO 03-09 05:00:16 [utils.py:302]
(APIServer pid=1) INFO 03-09 05:00:16 [utils.py:302] β β ββ ββ
(APIServer pid=1) INFO 03-09 05:00:16 [utils.py:302] ββ ββ β β β βββ β version 0.16.1rc1.dev206+g097eb544e
(APIServer pid=1) INFO 03-09 05:00:16 [utils.py:302] ββββ β β β β model ykarout/Qwen3.5-9b-nvfp4
(APIServer pid=1) INFO 03-09 05:00:16 [utils.py:302] ββ βββββ βββββ β β
(APIServer pid=1) INFO 03-09 05:00:16 [utils.py:302]
(APIServer pid=1) INFO 03-09 05:00:16 [utils.py:238] non-default args: {'model_tag': 'ykarout/Qwen3.5-9b-nvfp4', 'port': 8888, 'model': 'ykarout/Qwen3.5-9b-nvfp4', 'max_model_len': 4096, 'enforce_eager': True, 'reasoning_parser': 'qwen3', 'max_num_batched_tokens': 4096, 'max_num_seqs': 1}
(APIServer pid=1) INFO 03-09 05:00:39 [model.py:530] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) INFO 03-09 05:00:39 [model.py:1553] Using max model len 4096
(APIServer pid=1) INFO 03-09 05:00:40 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=1) INFO 03-09 05:00:40 [config.py:544] Setting attention block size to 528 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-09 05:00:40 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) WARNING 03-09 05:00:40 [modelopt.py:984] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=1) INFO 03-09 05:00:40 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-09 05:00:40 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-09 05:00:40 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-09 05:00:40 [vllm.py:957] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=162) INFO 03-09 05:01:36 [core.py:101] Initializing a V1 LLM engine (v0.16.1rc1.dev206+g097eb544e) with config: model='ykarout/Qwen3.5-9b-nvfp4', speculative_config=None, tokenizer='ykarout/Qwen3.5-9b-nvfp4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ykarout/Qwen3.5-9b-nvfp4, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=162) INFO 03-09 05:01:54 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.17.0.4:56547 backend=nccl
(EngineCore_DP0 pid=162) INFO 03-09 05:01:55 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=162) INFO 03-09 05:02:15 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=162) INFO 03-09 05:02:15 [gpu_model_runner.py:4255] Starting to load model ykarout/Qwen3.5-9b-nvfp4...
(EngineCore_DP0 pid=162) INFO 03-09 05:02:16 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=162) INFO 03-09 05:02:16 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=162) INFO 03-09 05:02:16 [nvfp4_utils.py:85] Using NvFp4LinearBackend.FLASHINFER_CUTLASS for NVFP4 GEMM
(EngineCore_DP0 pid=162) INFO 03-09 05:02:16 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=162) INFO 03-09 05:02:16 [flash_attn.py:587] Using FlashAttention version 2
(EngineCore_DP0 pid=162) INFO 03-09 05:04:45 [weight_utils.py:561] Time spent downloading weights for ykarout/Qwen3.5-9b-nvfp4: 147.726298 seconds
(EngineCore_DP0 pid=162) INFO 03-09 05:04:45 [weight_utils.py:601] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.04s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.04s/it]
(EngineCore_DP0 pid=162)
(EngineCore_DP0 pid=162) INFO 03-09 05:04:46 [default_loader.py:293] Loading weights took 1.05 seconds
(EngineCore_DP0 pid=162) INFO 03-09 05:04:47 [gpu_model_runner.py:4338] Model loading took 11.19 GiB memory and 150.913677 seconds
(EngineCore_DP0 pid=162) INFO 03-09 05:04:47 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=162) INFO 03-09 05:05:01 [gpu_worker.py:424] Available KV cache memory: 15.04 GiB
(EngineCore_DP0 pid=162) INFO 03-09 05:05:01 [kv_cache_utils.py:1314] GPU KV cache size: 123,024 tokens
(EngineCore_DP0 pid=162) INFO 03-09 05:05:01 [kv_cache_utils.py:1319] Maximum concurrency for 4,096 tokens per request: 84.82x
(EngineCore_DP0 pid=162) 2026-03-09 05:05:01,697 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(EngineCore_DP0 pid=162) 2026-03-09 05:05:02,350 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(EngineCore_DP0 pid=162) INFO 03-09 05:05:03 [core.py:282] init engine (profile, create kv cache, warmup model) took 15.82 seconds
(EngineCore_DP0 pid=162) INFO 03-09 05:05:03 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=162) WARNING 03-09 05:05:03 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP0 pid=162) WARNING 03-09 05:05:03 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=162) INFO 03-09 05:05:03 [vllm.py:957] Cudagraph is disabled under eager mode
(APIServer pid=1) INFO 03-09 05:05:03 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=1) INFO 03-09 05:05:05 [serving.py:185] Warming up chat template processing...
(APIServer pid=1) INFO 03-09 05:05:18 [hf.py:318] Detected the chat template content format to be 'string'. You can set--chat-template-content-formatto override this.
(APIServer pid=1) INFO 03-09 05:05:18 [serving.py:210] Chat template warmup completed in 13327.2ms
(APIServer pid=1) INFO 03-09 05:05:20 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8888
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /docs, Methods: HEAD, GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 03-09 05:05:20 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO: Started server process [1]
(APIServer pid=1) INFO: Waiting for application startup.
(APIServer pid=1) INFO: Application startup complete.
(EngineCore_DP0 pid=162) /usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py:1181: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (14) < num_heads (16). This may indicate the inputs were passed in head-first format [B, H, T, ...] Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=162) return fn(*args, **kwargs)
(EngineCore_DP0 pid=162) /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py:113: UserWarning: Input tensor shape suggests potential format mismatch: seq_len (14) < num_heads (32). This may indicate the inputs were passed in head-first format [B, H, T, ...] when head_first=False was specified. Please verify your input tensor format matches the expected shape [B, T, H, ...].
(EngineCore_DP0 pid=162) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.16.1rc1.dev206+g097eb544e) with config: model='ykarout/Qwen3.5-9b-nvfp4', speculative_config=None, tokenizer='ykarout/Qwen3.5-9b-nvfp4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ykarout/Qwen3.5-9b-nvfp4, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []},
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-8edce0cec1e94e59-817b01bd,prompt_token_ids_len=14,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4082, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-8edce0cec1e94e59-817b01bd: 14}, total_num_scheduled_tokens=14, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0042918454935622075, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] engine_core.run_busy_loop()
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] self._process_engine_step()
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 501, in step_with_batch_queue
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] exec_model_fut.result()
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self.__get_result()
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] raise self._exception
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return func(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self.worker.execute_model(scheduler_output)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return func(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 720, in execute_model
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] output = self.model_runner.execute_model(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return func(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3613, in execute_model
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] model_output = self._model_forward(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3126, in _model_forward
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self.model(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] hidden_states = self.language_model.model(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 389, in call
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1151, in forward
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] hidden_states, residual = layer(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1045, in forward
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] self.linear_attn(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 186, in forward
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] torch.ops.vllm.gdn_attention_core(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in call
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self._op(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] self._forward_core(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 780, in _forward_core
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ) = self.chunk_gated_delta_rule(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py", line 129, in forward
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self._forward_method(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 200, in forward_native
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return fla_chunk_gated_delta_rule(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return fn(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 207, in chunk_gated_delta_rule
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] o, final_state = ChunkGatedDeltaRuleFunction.apply(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 583, in apply
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return super().apply(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py", line 113, in wrapper
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/torch/amp/autocast_mode.py", line 477, in decorate_fwd
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return fwd(*args, **kwargs) # pyrefly: ignore [not-callable]
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 94, in forward
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] g, o, A, final_state, w, h, v_new = chunk_gated_delta_rule_fwd(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 40, in chunk_gated_delta_rule_fwd
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py", line 113, in wrapper
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/solve_tril.py", line 545, in solve_tril
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] merge_fn[NT, B * H](
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 459, in run
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self.fn.run(*args, **kwargs)
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 240, in run
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] benchmark()
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 229, in benchmark
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 164, in _bench
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/triton/testing.py", line 149, in do_bench
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] fn()
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 150, in kernel_call
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] self.fn.run(
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 744, in run
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/driver.py", line 713, in call
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, self.launch_pdl,
(EngineCore_DP0 pid=162) ERROR 03-09 05:07:01 [core.py:1102] RuntimeError: Triton Error [CUDA]: out of memory
(EngineCore_DP0 pid=162) Process EngineCore_DP0:
(EngineCore_DP0 pid=162) Traceback (most recent call last):
(EngineCore_DP0 pid=162) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=162) self.run()
(EngineCore_DP0 pid=162) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=162) self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=162) raise e
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=162) engine_core.run_busy_loop()
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=162) self._process_engine_step()
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=162) outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 501, in step_with_batch_queue
(EngineCore_DP0 pid=162) exec_model_fut.result()
(EngineCore_DP0 pid=162) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=162) return self.__get_result()
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=162) raise self._exception
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore_DP0 pid=162) result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=162) return func(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model
(EngineCore_DP0 pid=162) return self.worker.execute_model(scheduler_output)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=162) return func(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 720, in execute_model
(EngineCore_DP0 pid=162) output = self.model_runner.execute_model(
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=162) return func(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3613, in execute_model
(EngineCore_DP0 pid=162) model_output = self._model_forward(
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3126, in _model_forward
(EngineCore_DP0 pid=162) return self.model(
(EngineCore_DP0 pid=162) ^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=162) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=162) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 738, in forward
(EngineCore_DP0 pid=162) hidden_states = self.language_model.model(
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 389, in call
(EngineCore_DP0 pid=162) return self.forward(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-09 05:07:01 [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 03-09 05:07:01 [async_llm.py:708] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-09 05:07:01 [async_llm.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
(APIServer pid=1) ERROR 03-09 05:07:01 [async_llm.py:708] outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-09 05:07:01 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-09 05:07:01 [async_llm.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
(APIServer pid=1) ERROR 03-09 05:07:01 [async_llm.py:708] raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 03-09 05:07:01 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1151, in forward
(EngineCore_DP0 pid=162) hidden_states, residual = layer(
(EngineCore_DP0 pid=162) ^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=162) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=162) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1045, in forward
(EngineCore_DP0 pid=162) self.linear_attn(
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=162) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=162) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_5.py", line 186, in forward
(EngineCore_DP0 pid=162) torch.ops.vllm.gdn_attention_core(
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in call
(EngineCore_DP0 pid=162) return self._op(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 1451, in gdn_attention_core
(EngineCore_DP0 pid=162) self._forward_core(
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 780, in _forward_core
(EngineCore_DP0 pid=162) ) = self.chunk_gated_delta_rule(
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore_DP0 pid=162) return self._call_impl(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore_DP0 pid=162) return forward_call(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/custom_op.py", line 129, in forward
(EngineCore_DP0 pid=162) return self._forward_method(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/qwen3_next.py", line 200, in forward_native
(EngineCore_DP0 pid=162) return fla_chunk_gated_delta_rule(
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1181, in _fn
(EngineCore_DP0 pid=162) return fn(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 207, in chunk_gated_delta_rule
(EngineCore_DP0 pid=162) o, final_state = ChunkGatedDeltaRuleFunction.apply(
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/autograd/function.py", line 583, in apply
(EngineCore_DP0 pid=162) return super().apply(*args, **kwargs) # type: ignore[misc]
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py", line 113, in wrapper
(EngineCore_DP0 pid=162) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/torch/amp/autocast_mode.py", line 477, in decorate_fwd
(EngineCore_DP0 pid=162) return fwd(*args, **kwargs) # pyrefly: ignore [not-callable]
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 94, in forward
(EngineCore_DP0 pid=162) g, o, A, final_state, w, h, v_new = chunk_gated_delta_rule_fwd(
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/chunk.py", line 40, in chunk_gated_delta_rule_fwd
(EngineCore_DP0 pid=162) A = solve_tril(A=A, cu_seqlens=cu_seqlens, output_dtype=k.dtype)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/utils.py", line 113, in wrapper
(EngineCore_DP0 pid=162) return fn(*contiguous_args, **contiguous_kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fla/ops/solve_tril.py", line 545, in solve_tril
(EngineCore_DP0 pid=162) merge_fn[NT, B * H](
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in
(EngineCore_DP0 pid=162) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 459, in run
(EngineCore_DP0 pid=162) return self.fn.run(*args, **kwargs)
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 240, in run
(EngineCore_DP0 pid=162) benchmark()
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 229, in benchmark
(EngineCore_DP0 pid=162) timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 164, in _bench
(EngineCore_DP0 pid=162) return self.do_bench(kernel_call, quantiles=(0.5, 0.2, 0.8))
(EngineCore_DP0 pid=162) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/triton/testing.py", line 149, in do_bench
(EngineCore_DP0 pid=162) fn()
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/autotuner.py", line 150, in kernel_call
(EngineCore_DP0 pid=162) self.fn.run(
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 744, in run
(EngineCore_DP0 pid=162) kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
(EngineCore_DP0 pid=162) File "/usr/local/lib/python3.12/dist-packages/triton/backends/nvidia/driver.py", line 713, in call
(EngineCore_DP0 pid=162) self.launch(gridX, gridY, gridZ, stream, function, self.launch_cooperative_grid, self.launch_pdl,
(EngineCore_DP0 pid=162) RuntimeError: Triton Error [CUDA]: out of memory
(APIServer pid=1) INFO: 192.168.1.64:47888 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[rank0]:[W309 05:07:02.567626189 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=1) INFO: Shutting down
(APIServer pid=1) INFO: Waiting for application shutdown.
(APIServer pid=1) INFO: Application shutdown complete.
(APIServer pid=1) INFO: Finished server process [1]
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.