Skip to content

Conversation

drisspg
Copy link
Contributor

@drisspg drisspg commented Aug 28, 2025

Purpose

Seeing out of smem on fp32 on l4 CI

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@drisspg drisspg force-pushed the reduce-block-size-on-l4 branch from 69eb2b6 to 8c682cb Compare August 28, 2025 18:49
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an out-of-shared-memory error when using FlexAttention with fp32 on GPUs with limited shared memory. The change replaces a hardcoded block size with a relative reduction, which is a good improvement. My review includes a suggestion to make this logic more robust by dynamically calculating the shared memory requirement instead of relying on a magic number. This will prevent future errors and improve performance across different models.

Comment on lines +791 to +792
kernel_options["BLOCK_M"] = kernel_options["BLOCK_M"] // 2
kernel_options["BLOCK_N"] = kernel_options["BLOCK_N"] // 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

While halving the block size is a good improvement over the previous hardcoded value, the condition that triggers this reduction is based on a magic number (144 * 1024). This appears to be a heuristic for a worst-case scenario (e.g., head_size=512), which makes the logic brittle.

This can lead to two problems:

  1. Correctness: If a model with a head_size larger than anticipated is used, it could still lead to an out-of-shared-memory error.
  2. Performance: For models with smaller head_size, the block size might be reduced unnecessarily, leading to suboptimal performance.

A more robust approach would be to dynamically calculate the estimated shared memory requirement based on the actual head_size and dtype of the query tensor. This would make the logic more resilient and performant across different model architectures.

Here is a suggested implementation that would replace lines 790-792:

        if torch.cuda.is_available():
            device_props = torch.cuda.get_device_properties()
            max_shared_memory = device_props.shared_memory_per_block_optin

            head_size = query.shape[-1]
            dtype_size = query.element_size()

            block_m = kernel_options["BLOCK_M"]
            block_n = kernel_options["BLOCK_N"]

            # Estimate shared memory for Q, K, and softmax stats (m, l).
            # This is based on common Triton flash attention implementations.
            required_smem = (block_m + block_n) * head_size * dtype_size + (block_m * block_n * 4)

            if required_smem > max_shared_memory:
                kernel_options["BLOCK_M"] //= 2
                kernel_options["BLOCK_N"] //= 2

huydhn added a commit to huydhn/vllm that referenced this pull request Aug 28, 2025
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 28, 2025
@WoosukKwon WoosukKwon enabled auto-merge (squash) August 28, 2025 21:33
@DarkLight1337
Copy link
Member

Please fix pre-commit

@huydhn huydhn mentioned this pull request Aug 29, 2025
10 tasks
@huydhn
Copy link
Contributor

huydhn commented Aug 29, 2025

Thank @drisspg for the fix here. This has been merged via #20358 after @youkaichao merged the PR earlier. #20358 has your change to test it out with 2.8.0. I guess we can close this PR.

Copy link

mergify bot commented Aug 30, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @drisspg.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 30, 2025
@drisspg drisspg closed this Aug 30, 2025
auto-merge was automatically disabled August 30, 2025 16:36

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase ready ONLY add when PR is ready to merge/full CI is needed v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants