[ROCm][MLA] Support block-size > 1 for AITER MLA backend #27224

ganyi1996ppo · 2025-10-20T20:30:40Z

Purpose

The AITERMLABackend now only support block-size=1 scenario for inference. This constrain may lead to some serious host overhead when we are about to allocate or free cache blocks for long context requests cause there might exist large amount of blocks to operate.

In this PR, we remapping the block_table to 1 block size case every step in AITERMLAMetadataBuilder to alleviate the host overhead during allocate and deallocate blocks This change also helps to support a wider range of block size for AITERMLABackend, makes the AITERMLABackend on ROCm platform aligns with the vllm's official usgae and more flexible .

Test Plan

Verified on gsm8k for accuracy, performance improvement will also be attached later
test script:


export VLLM_USE_V1=1
export SAFETENSORS_FAST_GPU=1
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_USE_TRITON_FLASH_ATTN=0
export NCCL_DEBUG=WARN
export VLLM_RPC_TIMEOUT=1800000
export VLLM_ROCM_USE_AITER_ASMMOE=1
export VLLM_ROCM_USE_AITER_MHA=0
export VLLM_ROCM_USE_TRITON_ROPE=1

model_path="deepseek-r1-FP8-Dynamic"
vllm serve $model_path \
  --tensor-parallel-size 8 \
  --max-num-batched-tokens 32768 \
  --trust-remote-code \
  --no-enable-prefix-caching \
  --disable-log-requests \
  --gpu_memory_utilization 0.9 \
  --block-size 128 \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

Test Result

# gsm8k test
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9507|±  |0.0060|
|     |       |strict-match    |     5|exact_match|↑  |0.9484|±  |0.0061|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: ganyi <[email protected]>

gemini-code-assist

Code Review

This pull request enables support for block sizes greater than 1 for the AITER MLA backend on ROCm, which was previously a limitation. The approach of remapping the block table to token-level indices to match the expectation of the underlying AITER kernel is sound. The implementation is largely correct, but I've identified a potential issue with a hardcoded device string that could lead to runtime errors in multi-GPU environments. Addressing this will improve the robustness of the change.

vllm/v1/attention/backends/mla/rocm_aiter_mla.py

Signed-off-by: ganyi <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-10-20T20:34:54Z

vllm/v1/attention/backends/mla/rocm_aiter_mla.py

+        ).unsqueeze(0) < seq_lens_device.unsqueeze(1)
        paged_kv_indices = block_table_tensor[mask]

        paged_kv_last_page_len = seq_lens_device % page_size


Recompute last page lengths after token-level remapping

After expanding each block table entry into per-token indices, the code still derives paged_kv_last_page_len from the original page_size (seq_lens % page_size, falling back to page_size). Once the remapping is done, each entry represents a single token, so the last-page length for any non-empty request should always be 1. Keeping the old computation causes the decode kernel to believe that the final page contains page_size tokens (e.g. 128) and it will read that many elements starting from the last token’s index, potentially stepping past the valid token range when block_size > 1. This defeats the goal of supporting larger block sizes and can lead to out-of-bounds accesses or garbage attention results for any request longer than one token.

Useful? React with 👍 / 👎.

ganyi1996ppo · 2025-10-20T20:35:05Z

@HAIAI Please kindly help to review this PR.

tjtanaa · 2025-10-22T05:25:38Z

Just a sharing of the performance metric of this amazing optimization PR. There is improvement even in the original support block-size=1. 🚀

Here's a comparison table of the benchmark results on DeepSeek-R1 PTPC FP8:

General Metrics

Metric	Before PR (Block-size 1)	After PR (Block-size 1)	After PR (Block-size 16)	Best Performance
General Performance
Successful requests	320	320	320	All equal
Benchmark duration (s)	359.51	354.92	360.31	Block-size 1
Total generated tokens	298,762	294,597	300,636	Block-size 16
Request throughput (req/s)	0.89	0.90	0.89	Block-size 1
Output token throughput (tok/s)	831.03	830.05	834.37	Block-size 16
Peak output token throughput (tok/s)	1,056.00	1,088.00	1,088.00	Block-size 1 & 16
Total token throughput (tok/s)	4,020.30	4,060.55	4,016.48	Block-size 1

Latency Metrics

Latency Metrics	Before PR	After PR (Block-size 1)	After PR (Block-size 16)	Best Performance
Mean TTFT (ms)	1,923.96	1,522.36	1,742.48	After PR (Block-size 1)
Median TTFT (ms)	1,686.80	1,411.06	1,655.13	After PR (Block-size 1)
P99 TTFT (ms)	5,553.48	5,530.39	5,531.85	After PR (Block-size 1)
Mean TPOT (ms)	57.56	53.10	58.82	After PR (Block-size 1)
Median TPOT (ms)	35.44	36.28	35.15	After PR (Block-size 16)
P99 TPOT (ms)	721.81	805.45	647.23	After PR (Block-size 16)
Mean ITL (ms)	35.03	35.79	34.86	Before PR
Median ITL (ms)	31.29	31.43	31.02	After PR (Block-size 16)
P99 ITL (ms)	209.77	211.47	210.17	Before PR

Workload

#!/bin/bash
PORT=8000
SEED=0
CONCURRENCY=32
NREQUESTS=$(($CONCURRENCY * 10))
ISL=3584
OSL=1024
vllm bench serve --backend vllm \
--model EmbeddedLLM/deepseek-r1-FP8-Dynamic \
--dataset-name random \
--num-prompts ${NREQUESTS} \
--random-input ${ISL} \
--random-output ${OSL} \
--seed ${SEED} \
--max-concurrency ${CONCURRENCY} --port ${PORT}

ganyi1996ppo · 2025-10-22T07:08:07Z

Just a sharing of the performance metric of this amazing optimization PR. There is improvement even in the original support block-size=1. 🚀

@tjtanaa Thanks for the benchmark metric you shared! I'm actually quite surprise that the block-size=1 get performance boost, we tested this PR with block-size=128 and see slight performance boost over previous result with block-size=1

tjtanaa · 2025-10-22T07:50:59Z

@ganyi1996ppo can you share the performance value of your experiment for block-size=128 vs block-size=1?

support block-size > 1 for mla by remapping block table

5766cdb

Signed-off-by: ganyi <[email protected]>

ganyi1996ppo requested a review from gshtras as a code owner October 20, 2025 20:30

mergify bot added rocm Related to AMD ROCm v1 labels Oct 20, 2025

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

vllm/v1/attention/backends/mla/rocm_aiter_mla.py Outdated Show resolved Hide resolved

resolve comments

0f461cf

Signed-off-by: ganyi <[email protected]>

chatgpt-codex-connector bot reviewed Oct 20, 2025

View reviewed changes

wuhuikx mentioned this pull request Oct 22, 2025

Dev/perf sync ROCm/vllm#754

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm][MLA] Support block-size > 1 for AITER MLA backend #27224

[ROCm][MLA] Support block-size > 1 for AITER MLA backend #27224

Uh oh!

ganyi1996ppo commented Oct 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 20, 2025

Uh oh!

ganyi1996ppo commented Oct 20, 2025

Uh oh!

tjtanaa commented Oct 22, 2025 •

edited

Loading

Uh oh!

ganyi1996ppo commented Oct 22, 2025

Uh oh!

tjtanaa commented Oct 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[ROCm][MLA] Support block-size > 1 for AITER MLA backend #27224

Are you sure you want to change the base?

[ROCm][MLA] Support block-size > 1 for AITER MLA backend #27224

Uh oh!

Conversation

ganyi1996ppo commented Oct 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

ganyi1996ppo commented Oct 20, 2025

Uh oh!

tjtanaa commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

General Metrics

Latency Metrics

Workload

Uh oh!

ganyi1996ppo commented Oct 22, 2025

Uh oh!

tjtanaa commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ganyi1996ppo commented Oct 20, 2025 •

edited by github-actions bot

Loading

tjtanaa commented Oct 22, 2025 •

edited

Loading

tjtanaa commented Oct 22, 2025 •

edited

Loading