[Performance]: vLLM v0.15.0 throughput regression compared to ROCm vLLM v0.14.0

### Proposal to improve performance

_No response_

### Report of performance regression

Performance observations were conducted for vLLM [**v0.15.0**](https://hub.docker.com/r/vllm/vllm-openai-rocm/tags) in comparison with the ROCm forked vLLM [**v0.14.0**](https://hub.docker.com/r/rocm/vllm) (now deprecated). Testing was executed on a server equipped with **8× AMD Instinct MI300X GPUs**.

Benchmarking was performed using the **vLLM bench** utility across **eight Docker-based vLLM serving instances** of the model Qwen3-30B-A3B-Thinking-2507. Traffic distribution across the serving instances was handled through nginx load balancing.

The command executed is shown below:
` HF_HOME=/mnt/models HUGGINGFACE_HUB_CACHE=/mnt/models/hub TRANSFORMERS_CACHE=/mnt/models/hub vllm bench serve   --backend vllm   --model Qwen/Qwen3-30B-A3B-Thinking-2507   --tokenizer /mnt/models/hub/models--Qwen--Qwen3-30B-A3B-Thinking-2507/snapshots/144afc2f379b542fdd4e85a1fcd5e1f79112d95d   --host localhost   --port 8000   --endpoint /v1/completions   --dataset-name sharegpt   --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json  --output-len 128  --num-prompts 4096 --max-concurrency 4096  2>&1 | tee "$LOG"`

Benchmark results are attached along with the Docker Compose configuration used for testing. The configuration remained identical across both runs, with the **container image name representing the only modification**.

[v0.14.0.txt](https://github.com/user-attachments/files/25832905/v0.14.0.txt)

[v0.15.0.txt](https://github.com/user-attachments/files/25832918/v0.15.0.txt)

[docker-compose.yaml](https://github.com/user-attachments/files/25832990/docker-compose.yaml)


### Misc discussion on performance

Benchmark results indicate a **performance difference between vLLM v0.15.0 and the ROCm forked v0.14.0**. Under the tested configuration, **v0.14.0 demonstrates higher throughput**, achieving approximately **1.34× greater request throughput and output token throughput** compared to **v0.15.0**. 

However, **v0.15.0 shows slightly lower time-to-first-token (TTFT)** compared to v0.14.0.

Also this difference widens in comparison to even some of the older versions of rocm/vllm. 


### Your current environment (if you think it is necessary)

```text
The output of `python collect_env.py`
```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Performance]: vLLM v0.15.0 throughput regression compared to ROCm vLLM v0.14.0 #36454

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Performance]: vLLM v0.15.0 throughput regression compared to ROCm vLLM v0.14.0 #36454

Description

Proposal to improve performance

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions