-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
[V1] port xformers backend to v1 #21342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[V1] port xformers backend to v1 #21342
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request ports the xformers attention backend to the v1 engine, including the implementation, tests, and wiring it into the system. The implementation correctly splits logic for prefill and decode phases for optimization. My review identified two high-severity issues: one is a naming inconsistency for the new backend that would prevent it from being selected correctly, and the other is an inadequate test case in the new test file that only covers the decode path, leaving the prefill path untested. I've provided suggestions to fix both issues.
82f774a
to
137c8c1
Compare
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Giancarlo Delfin <[email protected]>
137c8c1
to
ee2c9ee
Compare
unified_attention( | ||
q=query[num_decode_tokens:num_actual_tokens], | ||
k=key_cache, | ||
v=value_cache, | ||
out=output[num_decode_tokens:num_actual_tokens], | ||
cu_seqlens_q=prefill_meta.query_start_loc, | ||
max_seqlen_q=prefill_meta.max_query_len, | ||
seqused_k=prefill_meta.seq_lens, | ||
max_seqlen_k=prefill_meta.max_seq_len, | ||
softmax_scale=self.scale, | ||
causal=True, | ||
alibi_slopes=self.alibi_slopes, | ||
window_size=self.sliding_window, | ||
block_table=prefill_meta.block_table, | ||
softcap=self.logits_soft_cap, | ||
q_descale=None, # Not supported | ||
k_descale=layer._k_scale.expand(descale_shape), | ||
v_descale=layer._v_scale.expand(descale_shape), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: Why does it fall back to the Triton kernel? IIRC, the Triton kernel here is not very well optimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for the info, would you recommend using FA3 instead?
self._num_decode_tokens = 0 | ||
|
||
def reorder_batch(self, input_batch: "InputBatch", | ||
scheduler_output: "SchedulerOutput") -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we can't use reorder_batch_to_split_decodes_and_prefills
in vllm/v1/attention/backends/utils.py
here? like in FlashInfer:
vllm/vllm/v1/attention/backends/flashinfer.py
Line 244 in b77c7d3
return reorder_batch_to_split_decodes_and_prefills(input_batch, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must have been added after i started working on this PR, thanks, i will use this
|
||
@staticmethod | ||
def get_supported_head_sizes() -> list[int]: | ||
return [32, 64, 96, 128, 160, 192, 224, 256] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does xFormers support more head sizes then this? might be a nice option as alternative head size 80 (which falls back to FlexAttention currently)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for catching, turns out xformers supports a lot of head sizes
Purpose
Port over the xformers backend to the v1 engine. There are several benefits to using XFormers, including:
Test Plan
Added test case to
test_attention_backends
which verifies correctness of the xformers v1 backend attention output.Benchmark
In addition, I used the following command to run the LLM service and benchmark TreeAttentionBackend vs FlashAttentionBackend:
Server
Client
Results
The v1 XFormers backend performs better than FA3, and is on par with Triton split-k.