-
-
Notifications
You must be signed in to change notification settings - Fork 14.1k
Description
Proposal to improve performance
No response
Report of performance regression
Performance observations were conducted for vLLM v0.15.0 in comparison with the ROCm forked vLLM v0.14.0 (now deprecated). Testing was executed on a server equipped with 8× AMD Instinct MI300X GPUs.
Benchmarking was performed using the vLLM bench utility across eight Docker-based vLLM serving instances of the model Qwen3-30B-A3B-Thinking-2507. Traffic distribution across the serving instances was handled through nginx load balancing.
The command executed is shown below:
HF_HOME=/mnt/models HUGGINGFACE_HUB_CACHE=/mnt/models/hub TRANSFORMERS_CACHE=/mnt/models/hub vllm bench serve --backend vllm --model Qwen/Qwen3-30B-A3B-Thinking-2507 --tokenizer /mnt/models/hub/models--Qwen--Qwen3-30B-A3B-Thinking-2507/snapshots/144afc2f379b542fdd4e85a1fcd5e1f79112d95d --host localhost --port 8000 --endpoint /v1/completions --dataset-name sharegpt --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --output-len 128 --num-prompts 4096 --max-concurrency 4096 2>&1 | tee "$LOG"
Benchmark results are attached along with the Docker Compose configuration used for testing. The configuration remained identical across both runs, with the container image name representing the only modification.
Misc discussion on performance
Benchmark results indicate a performance difference between vLLM v0.15.0 and the ROCm forked v0.14.0. Under the tested configuration, v0.14.0 demonstrates higher throughput, achieving approximately 1.34× greater request throughput and output token throughput compared to v0.15.0.
However, v0.15.0 shows slightly lower time-to-first-token (TTFT) compared to v0.14.0.
Also this difference widens in comparison to even some of the older versions of rocm/vllm.
Your current environment (if you think it is necessary)
The output of `python collect_env.py`
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status