[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile #21350

rasmith · 2025-07-22T05:05:03Z

It turns out that torch.compile was omitting calls to wvSplitK, our skinny gemm, since it was never being called after torch.compile compilation.

Several attempts were made to fix this, such as lifting the conditional logic in rocm_unquantized_gemm into its caller, converting wvSplitK into a custom op, and twiddling with the logic in rocm_unquantized_gemm.

Converting rocm_unquantized_gemm into a custom op via direct_register_custom_op with register_fake fixes the problem. Confirmed fix with profiler runs.

Note: this is for small batch sizes, batch size 1-4.

Pertinent information from profiler run below, number of calls at end of profiler information.

**** Without direct_register_custom_op ****

void wvSplitK_hf_sml_<__hip_bfloat16, 64, 2, 16, 8, ... 0.00% 0.000us 0.00% 0.000us 0.000us 57.828ms 4.10% 57.828ms 225.892us 256

*** With direct_register_custom_op ***

void wvSplitK_hf_sml_<__hip_bfloat16, 64, 2, 16, 8, ... 0.00% 0.000us 0.00% 0.000us 0.000us 948.566ms 70.23% 948.566ms 28.835us 32896

Signed-off-by: Randall Smith <[email protected]>

github-actions · 2025-07-22T05:05:27Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request addresses an issue with torch.compile on ROCm platforms where a specific kernel call (wvSplitK) was being omitted. The fix involves refactoring rocm_unquantized_gemm into a custom PyTorch operator. This is a sound approach to work around compiler issues. My review identified a critical issue in the 'fake' implementation of this new custom operator, which could lead to incorrect shape inference for input tensors with more than two dimensions. I've provided a suggestion to fix this.

vllm/model_executor/layers/utils.py

Signed-off-by: Randall Smith <[email protected]>

SageMoore

How does this bug manifest on main? It sounds like we just get a wrong answer when the dispatching logic decides to run skinny_gemm in V1? Do you have a minimal reproducible example?

SageMoore · 2025-07-22T16:27:22Z

vllm/model_executor/layers/utils.py

+def rocm_unquantized_gemm(layer: torch.nn.Module,
+                          x: torch.Tensor,
+                          weight: torch.Tensor,
+                          bias: Optional[torch.Tensor] = None):


Nit: Let's add the return type here.

Signed-off-by: Randall Smith <[email protected]>

rasmith · 2025-07-22T16:59:32Z

How does this bug manifest on main? It sounds like we just get a wrong answer when the dispatching logic decides to run skinny_gemm in V1? Do you have a minimal reproducible example?

I'm not sure what the root cause is related to how torch.compile or the internal vllm compiliation logic is implemented, but you can see the bug by running on almost any model, I used Llama-3.1-8B-Instruct for testing. Then just run the profiler and see how many times the function is called.

SageMoore · 2025-07-24T02:27:01Z

Ok so if I'm understanding correctly the problem is that, even if all of the conditions for skinny gemm are true, rocm_unquantized_gemm will still use torch.nn.functional.linear when run through torch.compile? I ran a quick sharegpt serving benchmark and didn't see any slowdowns from the custom op registration so I think this is fine.

Before I accept can you just quickly verify that this issue persists even after you clear out both torch.compile caches?
~/.cache/vllm/torch_compile_cache/ and /tmp/torchinductor_$USER/?

SageMoore

I think this is a reasonable fix for now, but it would be good to understand what's going on here.

zou3519 · 2025-07-24T23:48:43Z

vllm/model_executor/layers/utils.py

    return torch.nn.functional.linear(x, weight, bias)


+def rocm_unquantized_gemm_impl_fake(


I mentioned this offline, but it is generally better to put the minimal amount of things needed into the custom op. In this situation I think that's the following (+ some dependencies)

if m > 8 and 0 < n <= 4: out = ops.wvSplitK(weight, x_view, cu_count) return out.view(*x.shape[:-1], weight.shape[0]) elif m % 4 == 0 and n == 1 and k <= 8192: out = ops.LLMM1(weight, x_view, 4) return out.view(*x.shape[:-1], weight.shape[0]) else return torch.nn.functional.linear(x, weight, bias)

the reason being is that torch.compile may be able to optimize the nn.Linears. For example, it is able to select different matmul kernels, or fuse operations into it (if there are fusable operations nearby).

That being said, it's not clear to me how much torch.compile is able to do for matmuls on ROCM, so, feel free to ship this as-is

zou3519 · 2025-07-24T23:52:11Z

To answer the overall question, @SageMoore:

The way vLLM uses torch.compile is that it's using it to capture one single graph that works for all batch sizes. If there are branches on the shape in the graph, then the graph capture will pick one of the branches (it will pick the branch for the batch size vLLM is using to perform the initial graph capture with), and then the resulting graph will only be correct for the conditions on that branch.

In this case here, the condition is batch_size > 4.

The workarounds are generally:

don't do the branching
hide the logic inside a custom operator

In this case here the branching is for kernel selection, so hiding the logic in a custom operator seems reasonable.

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]>

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]> Signed-off-by: x22x22 <[email protected]>

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]>

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Jinzhen Lin <[email protected]>

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Paul Pak <[email protected]>

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]> Signed-off-by: Diego-Castan <[email protected]>

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]>

rasmith added 4 commits July 18, 2025 16:52

working register_fake, but need direct_register_custom_op

25cefed

Signed-off-by: Randall Smith <[email protected]>

remove debug code

a35bcac

Signed-off-by: Randall Smith <[email protected]>

formatting

0044390

Signed-off-by: Randall Smith <[email protected]>

revert __init__

d22d7c7

Signed-off-by: Randall Smith <[email protected]>

mergify bot added the rocm Related to AMD ROCm label Jul 22, 2025

gemini-code-assist bot reviewed Jul 22, 2025

View reviewed changes

vllm/model_executor/layers/utils.py Outdated Show resolved Hide resolved

fake tensor

e355d2c

Signed-off-by: Randall Smith <[email protected]>

SageMoore reviewed Jul 22, 2025

View reviewed changes

add return type

1537e59

Signed-off-by: Randall Smith <[email protected]>

rasmith changed the title ~~[AMD][BugFix] Fix omission of wvSplitK kernel due to torch.compile~~ [AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile Jul 22, 2025

gshtras added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 24, 2025

SageMoore approved these changes Jul 24, 2025

View reviewed changes

Merge branch 'vllm-project:main' into ransmith_fix_rocm_unquantized_gemm

5c7668b

zou3519 reviewed Jul 24, 2025

View reviewed changes

rasmith added 2 commits July 25, 2025 23:26

Merge branch 'vllm-project:main' into ransmith_fix_rocm_unquantized_gemm

504720c

Merge branch 'vllm-project:main' into ransmith_fix_rocm_unquantized_gemm

a860838

gshtras merged commit b361f14 into vllm-project:main Jul 28, 2025
64 checks passed

liuyumoye pushed a commit to liuyumoye/vllm that referenced this pull request Jul 31, 2025

[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (…

0e05b18

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]>

Pradyun92 pushed a commit to Pradyun92/vllm that referenced this pull request Aug 6, 2025

[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (…

6d306dd

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]>

npanpaliya pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Aug 6, 2025

[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (…

bfac2ea

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (…

2c96e3d

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (…

512b2e1

…1-4) due to torch.compile (vllm-project#21350) Signed-off-by: Randall Smith <[email protected]>

		return torch.nn.functional.linear(x, weight, bias)


		def rocm_unquantized_gemm_impl_fake(

Uh oh!

Uh oh!

[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile #21350

[AMD][BugFix] Fix omission of wvSplitK kernel for small batch sizes (1-4) due to torch.compile #21350

Uh oh!

Conversation

rasmith commented Jul 22, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 22, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

SageMoore Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

rasmith Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

rasmith commented Jul 22, 2025

Uh oh!

SageMoore commented Jul 24, 2025

Uh oh!

SageMoore left a comment

Choose a reason for hiding this comment

Uh oh!

zou3519 Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zou3519 commented Jul 24, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rasmith commented Jul 22, 2025 •

edited by github-actions bot

Loading

zou3519 Jul 24, 2025 •

edited

Loading