Skip to content

[Bug]: Qwen3.5-35B-A3B-FP8 inference output terminates unexpectedly, logs show normal but request hangs #36736

@RagnarokChan

Description

@RagnarokChan

Environment:

  • vLLM version: 0.17+ (CUDA 130)
  • Model: Qwen/Qwen3.5-35B-A3B-FP8
  • GPU: RTX 5090D × 2
  • Open WebUI version: 0.8.10
  • Launch command:
python3 -m vllm.entrypoints.openai.api_server \
 --model /home/ragnarokchan/models/Qwen3.5-35B-A3B-FP8 \
 --served-model-name Qwen3.5-35B-A3B-FP8 \
 --trust-remote-code \
 --gpu-memory-utilization 0.85 \
 --host 0.0.0.0 \
 --port 8000 \
 --tensor-parallel-size 2 \
 --enable-chunked-prefill \
 --max-num-seqs 16 \
 --max-model-len 65536 \
 --tool-call-parser qwen3_coder \
 --enable-auto-tool-choice \
 --calculate-kv-scales \
 --reasoning-parser qwen3

Bug Description:
When using Open WebUI to call vLLM for inference, the output suddenly terminates during generation. Logs show everything is normal, request status shows 200 OK, but the client hangs and cannot get the complete output.

The vLLM service itself does not crash. Re-sending the prompt (with priority) or opening a new chat can continue inference, but the same issue occurs again quickly.

Steps to Reproduce:

  1. Start vLLM service (configuration as above)
  2. Send a chat request via Open WebUI
  3. Model starts generating output, but stops mid-way
  4. Client cannot get complete response, request appears successful but content is truncated

Logs:

(APIServer pid=58580) INFO 03-11 11:13:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 269.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:13:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 182.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:48 [loggers.py:259] Engine 000: Avg prompt throughput: 425.2 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 148.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.4%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56267 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:14:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

Question:
Any suggestions for workarounds or fixes for this issue?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions