-
-
Notifications
You must be signed in to change notification settings - Fork 14.2k
Open
Description
Environment:
- vLLM version: 0.17+ (CUDA 130)
- Model: Qwen/Qwen3.5-35B-A3B-FP8
- GPU: RTX 5090D × 2
- Open WebUI version: 0.8.10
- Launch command:
python3 -m vllm.entrypoints.openai.api_server \
--model /home/ragnarokchan/models/Qwen3.5-35B-A3B-FP8 \
--served-model-name Qwen3.5-35B-A3B-FP8 \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 2 \
--enable-chunked-prefill \
--max-num-seqs 16 \
--max-model-len 65536 \
--tool-call-parser qwen3_coder \
--enable-auto-tool-choice \
--calculate-kv-scales \
--reasoning-parser qwen3Bug Description:
When using Open WebUI to call vLLM for inference, the output suddenly terminates during generation. Logs show everything is normal, request status shows 200 OK, but the client hangs and cannot get the complete output.
The vLLM service itself does not crash. Re-sending the prompt (with priority) or opening a new chat can continue inference, but the same issue occurs again quickly.
Steps to Reproduce:
- Start vLLM service (configuration as above)
- Send a chat request via Open WebUI
- Model starts generating output, but stops mid-way
- Client cannot get complete response, request appears successful but content is truncated
Logs:
(APIServer pid=58580) INFO 03-11 11:13:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 269.0 tokens/s, Running: 2 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56056 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:13:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 182.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:48 [loggers.py:259] Engine 000: Avg prompt throughput: 425.2 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:13:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 148.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 147.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.4%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO: 192.168.100.152:56267 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=58580) INFO 03-11 11:14:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 77.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=58580) INFO 03-11 11:14:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Question:
Any suggestions for workarounds or fixes for this issue?
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels