Reduce block sizes to no run out of shared memory for flex + fp32 #23853

drisspg · 2025-08-28T18:48:18Z

Purpose

Seeing out of smem on fp32 on l4 CI

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: drisspg <[email protected]>

gemini-code-assist

Code Review

This pull request aims to fix an out-of-shared-memory error when using FlexAttention with fp32 on GPUs with limited shared memory. The change replaces a hardcoded block size with a relative reduction, which is a good improvement. My review includes a suggestion to make this logic more robust by dynamically calculating the shared memory requirement instead of relying on a magic number. This will prevent future errors and improve performance across different models.

gemini-code-assist · 2025-08-28T18:50:01Z

vllm/v1/attention/backends/flex_attention.py

+                kernel_options["BLOCK_M"] = kernel_options["BLOCK_M"] // 2
+                kernel_options["BLOCK_N"] = kernel_options["BLOCK_N"] // 2


While halving the block size is a good improvement over the previous hardcoded value, the condition that triggers this reduction is based on a magic number (144 * 1024). This appears to be a heuristic for a worst-case scenario (e.g., head_size=512), which makes the logic brittle.

This can lead to two problems:

Correctness: If a model with a head_size larger than anticipated is used, it could still lead to an out-of-shared-memory error.

Performance: For models with smaller head_size, the block size might be reduced unnecessarily, leading to suboptimal performance.

A more robust approach would be to dynamically calculate the estimated shared memory requirement based on the actual head_size and dtype of the query tensor. This would make the logic more resilient and performant across different model architectures.

Here is a suggested implementation that would replace lines 790-792:

if torch.cuda.is_available(): device_props = torch.cuda.get_device_properties() max_shared_memory = device_props.shared_memory_per_block_optin head_size = query.shape[-1] dtype_size = query.element_size() block_m = kernel_options["BLOCK_M"] block_n = kernel_options["BLOCK_N"] # Estimate shared memory for Q, K, and softmax stats (m, l). # This is based on common Triton flash attention implementations. required_smem = (block_m + block_n) * head_size * dtype_size + (block_m * block_n * 4) if required_smem > max_shared_memory: kernel_options["BLOCK_M"] //= 2 kernel_options["BLOCK_N"] //= 2

Signed-off-by: Huy Do <[email protected]>

DarkLight1337 · 2025-08-29T02:41:06Z

Please fix pre-commit

huydhn · 2025-08-29T17:21:13Z

Thank @drisspg for the fix here. This has been merged via #20358 after @youkaichao merged the PR earlier. #20358 has your change to test it out with 2.8.0. I guess we can close this PR.

mergify · 2025-08-30T04:47:25Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @drisspg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

drisspg requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners August 28, 2025 18:48

mergify bot added the v1 label Aug 28, 2025

Reduce block sizes to no run out of shared memory for flex + fp32

8c682cb

Signed-off-by: drisspg <[email protected]>

drisspg force-pushed the reduce-block-size-on-l4 branch from 69eb2b6 to 8c682cb Compare August 28, 2025 18:49

gemini-code-assist bot reviewed Aug 28, 2025

View reviewed changes

huydhn added a commit to huydhn/vllm that referenced this pull request Aug 28, 2025

Apply vllm-project#23853

102d0d7

Signed-off-by: Huy Do <[email protected]>

WoosukKwon approved these changes Aug 28, 2025

View reviewed changes

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 28, 2025

WoosukKwon enabled auto-merge (squash) August 28, 2025 21:33

huydhn mentioned this pull request Aug 29, 2025

Update PyTorch to 2.8.0 #20358

Merged

10 tasks

mergify bot added the needs-rebase label Aug 30, 2025

drisspg closed this Aug 30, 2025

auto-merge was automatically disabled August 30, 2025 16:36
Pull request was closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Reduce block sizes to no run out of shared memory for flex + fp32 #23853

Reduce block sizes to no run out of shared memory for flex + fp32 #23853

Uh oh!

drisspg commented Aug 28, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Aug 28, 2025

Uh oh!

DarkLight1337 commented Aug 29, 2025

Uh oh!

huydhn commented Aug 29, 2025

Uh oh!

mergify bot commented Aug 30, 2025

Uh oh!

Uh oh!

		kernel_options["BLOCK_M"] = kernel_options["BLOCK_M"] // 2
		kernel_options["BLOCK_N"] = kernel_options["BLOCK_N"] // 2

Uh oh!

Reduce block sizes to no run out of shared memory for flex + fp32 #23853

Reduce block sizes to no run out of shared memory for flex + fp32 #23853

Uh oh!

Conversation

drisspg commented Aug 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Aug 29, 2025

Uh oh!

huydhn commented Aug 29, 2025

Uh oh!

mergify bot commented Aug 30, 2025

Uh oh!

Uh oh!

drisspg commented Aug 28, 2025 •

edited by github-actions bot

Loading