[V1] port xformers backend to v1 #21342

TheEpicDolphin · 2025-07-22T01:06:38Z

Purpose

Port over the xformers backend to the v1 engine. There are several benefits to using XFormers, including:

Built-in heursitic which determines which attention implementation is best suited for the given inputs.
AMD kernel support
Well suited for certain Meta models.

Test Plan

Added test case to test_attention_backends which verifies correctness of the xformers v1 backend attention output.

(py312conda) bash-5.1$ pytest tests/v1/attention/test_attention_backends.py -k test_backend_correctness
================================================================================ test session starts ================================================================================
platform linux -- Python 3.12.9, pytest-8.4.1, pluggy-1.6.0
rootdir: /data/users/gdelfin/gitrepos/vllm
configfile: pyproject.toml
plugins: anyio-4.9.0
collected 6 items                                                                                                                                                                   

tests/v1/attention/test_attention_backends.py ......                                                                                                                          [100%]

================================================================================= warnings summary ==================================================================================
tests/v1/attention/test_attention_backends.py::test_backend_correctness[meta-llama/Meta-Llama-3-8B-small_decode]
tests/v1/attention/test_attention_backends.py::test_backend_correctness[meta-llama/Meta-Llama-3-8B-small_decode]
tests/v1/attention/test_attention_backends.py::test_backend_correctness[meta-llama/Meta-Llama-3-8B-small_decode]
tests/v1/attention/test_attention_backends.py::test_backend_correctness[meta-llama/Meta-Llama-3-8B-small_decode]
  /home/gdelfin/.conda/envs/py312conda/lib/python3.12/site-packages/triton/runtime/autotuner.py:108: DeprecationWarning: warmup, rep, and use_cuda_graph parameters are deprecated. See https://github.com/triton-lang/triton/pull/4496 for details.
    warnings.warn(("warmup, rep, and use_cuda_graph parameters are deprecated. See "

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================== 6 passed, 4 warnings in 18.72s ===========================================================================

Benchmark

In addition, I used the following command to run the LLM service and benchmark TreeAttentionBackend vs FlashAttentionBackend:
Server

export VLLM_TORCH_PROFILER_DIR=~/traces/vllm
export LLAMA_MODEL=meta-llama/Llama-3.1-8B-Instruct
export DRAFT_MODEL=yuhuili/EAGLE-LLaMA3.1-Instruct-8B
export VLLM_USE_V1=1
export VLLM_ATTENTION_BACKEND=<backend>
python -m vllm.entrypoints.openai.api_server --model $LLAMA_MODEL --disable-log-requests --tensor-parallel-size=1 --max-num-seqs=64 --max-model-len=32768 --block-size=128 --no-enable-prefix-caching --speculative-config="$SPEC_DEC_CONFIG" 2>&1 | tee ~/server_logs/vllm_server.log

Client

export LLAMA_MODEL=meta-llama/Llama-3.1-8B-Instruct
python benchmarks/benchmark_serving.py --model $LLAMA_MODEL --tokenizer $LLAMA_MODEL --host 0.0.0.0 --dataset-name random --ignore-eos --request-rate inf --random-input-len 1000 --random-output-len 300 --max-concurrency 64 --num-prompts 128

Results

Serving Benchmark Result	Flash Attention 3	Triton Attention	XFormers Attention
Successful requests	128	128	128
Benchmark duration (s)	15.94	13.23	13.22
Total input tokens	127731	127731	127731
Total generated tokens	38400	38400	38400
Request throughput (req/s)	8.03	9.68	9.68
Output token throughput (tok/s)	2408.88	2903.54	2905.24
Total Token throughput (tok/s)	10421.59	12561.66	12569.01
Time to First Token
Mean TTFT (ms)	894.77	920.44	929.93
Median TTFT (ms)	856.32	769.44	776.85
P99 TTFT (ms)	1817.60	2063.83	2080.56
Time per Output Token (excl. 1st token)
Mean TPOT (ms)	23.49	18.94	18.87
Median TPOT (ms)	22.67	19.34	19.24
P99 TPOT (ms)	30.50	21.57	21.51
Inter-token Latency
Mean ITL (ms)	23.49	18.94	18.87
Median ITL (ms)	15.14	15.48	15.35
P99 ITL (ms)	222.39	252.56	256.01

The v1 XFormers backend performs better than FA3, and is on par with Triton split-k.

gemini-code-assist

Code Review

This pull request ports the xformers attention backend to the v1 engine, including the implementation, tests, and wiring it into the system. The implementation correctly splits logic for prefill and decode phases for optimization. My review identified two high-severity issues: one is a naming inconsistency for the new backend that would prevent it from being selected correctly, and the other is an inadequate test case in the new test file that only covers the decode path, leaving the prefill path untested. I've provided suggestions to fix both issues.

tests/v1/attention/test_backends.py

vllm/v1/attention/backends/xformers.py

github-actions · 2025-07-22T01:29:15Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: Giancarlo Delfin <[email protected]>

WoosukKwon · 2025-07-23T06:41:22Z

vllm/v1/attention/backends/xformers.py

+            unified_attention(
+                q=query[num_decode_tokens:num_actual_tokens],
+                k=key_cache,
+                v=value_cache,
+                out=output[num_decode_tokens:num_actual_tokens],
+                cu_seqlens_q=prefill_meta.query_start_loc,
+                max_seqlen_q=prefill_meta.max_query_len,
+                seqused_k=prefill_meta.seq_lens,
+                max_seqlen_k=prefill_meta.max_seq_len,
+                softmax_scale=self.scale,
+                causal=True,
+                alibi_slopes=self.alibi_slopes,
+                window_size=self.sliding_window,
+                block_table=prefill_meta.block_table,
+                softcap=self.logits_soft_cap,
+                q_descale=None,  # Not supported
+                k_descale=layer._k_scale.expand(descale_shape),
+                v_descale=layer._v_scale.expand(descale_shape),
+            )


QQ: Why does it fall back to the Triton kernel? IIRC, the Triton kernel here is not very well optimized.

Thx for the info, would you recommend using FA3 instead?

LucasWilkinson · 2025-07-23T13:48:36Z

vllm/v1/attention/backends/xformers.py

+        self._num_decode_tokens = 0
+
+    def reorder_batch(self, input_batch: "InputBatch",
+                      scheduler_output: "SchedulerOutput") -> bool:


is there a reason we can't use reorder_batch_to_split_decodes_and_prefills in vllm/v1/attention/backends/utils.py here? like in FlashInfer:

vllm/vllm/v1/attention/backends/flashinfer.py

Line 244 in b77c7d3

return reorder_batch_to_split_decodes_and_prefills(input_batch,

This must have been added after i started working on this PR, thanks, i will use this

LucasWilkinson · 2025-07-23T13:57:00Z

vllm/v1/attention/backends/xformers.py

+
+    @staticmethod
+    def get_supported_head_sizes() -> list[int]:
+        return [32, 64, 96, 128, 160, 192, 224, 256]


does xFormers support more head sizes then this? might be a nice option as alternative head size 80 (which falls back to FlexAttention currently)

Thx for catching, turns out xformers supports a lot of head sizes

mergify bot added the v1 label Jul 22, 2025

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

tests/v1/attention/test_backends.py Outdated Show resolved Hide resolved

vllm/v1/attention/backends/xformers.py Outdated Show resolved Hide resolved

TheEpicDolphin force-pushed the xformers_attention_v1 branch 2 times, most recently from 82f774a to 137c8c1 Compare July 22, 2025 01:22

[V1] port xformers backend to v1

ee2c9ee

Signed-off-by: Giancarlo Delfin <[email protected]>

TheEpicDolphin force-pushed the xformers_attention_v1 branch from 137c8c1 to ee2c9ee Compare July 23, 2025 05:52

TheEpicDolphin marked this pull request as ready for review July 23, 2025 05:54

TheEpicDolphin requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners July 23, 2025 05:54

WoosukKwon reviewed Jul 23, 2025

View reviewed changes

LucasWilkinson reviewed Jul 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[V1] port xformers backend to v1 #21342

[V1] port xformers backend to v1 #21342

TheEpicDolphin commented Jul 22, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

WoosukKwon Jul 23, 2025 •

edited

Loading

Uh oh!

TheEpicDolphin Jul 24, 2025

Uh oh!

LucasWilkinson Jul 23, 2025

Uh oh!

TheEpicDolphin Jul 24, 2025

Uh oh!

LucasWilkinson Jul 23, 2025

Uh oh!

TheEpicDolphin Jul 24, 2025

Uh oh!

Uh oh!

Uh oh!

[V1] port xformers backend to v1 #21342

Are you sure you want to change the base?

[V1] port xformers backend to v1 #21342

Conversation

TheEpicDolphin commented Jul 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Benchmark

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

WoosukKwon Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TheEpicDolphin Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

TheEpicDolphin Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Jul 23, 2025

Choose a reason for hiding this comment

Uh oh!

TheEpicDolphin Jul 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TheEpicDolphin commented Jul 22, 2025 •

edited by github-actions bot

Loading

WoosukKwon Jul 23, 2025 •

edited

Loading