Skip to content

Conversation

@gbyu-amd
Copy link

@gbyu-amd gbyu-amd commented Oct 30, 2025

Purpose

Two different MHA kernel implementations are available on ROCm GPUs, triton kernel and aiter kernels (including ck and asm, dispatched inside AITER). We benchmark the performance of them and observe the gap in various cases, based on which a simple dispatch logic is added in this pr.

Benchmark result

We benchmark the shapes from deepseek-v3 in TP8 scenario on MI355 for now, i.e., num_heads=16, qk_head_dim=192, v_head_dim=128. The seq_len and batch size range from 1k to 64k and 1 to 64, respectively.
Here aiter kernel actually corresponds to the FA3 asm kernel. Basically, the asm kernel demonstrates superior performance for seq_len of 4k and above, while triton kernel performs better with relatively short seq_len like 1k.

q_seqlen/kv_seqlen BS triton time (ms) triton Tflops aiter time (ms) aiter Tflops aiter vs. triton time speedup
1k/1k 1 0.077 68.81 0.024 219.89 3.21 ↗️
4 0.091 230.37 0.114 184.22 0.80 ↘️
8 0.165 255.03 0.345 122.48 0.48 ↘️
16 0.269 316.36 0.688 122.59 0.39 ↘️
32 0.472 361.22 1.554 108.59 0.30 ↘️
64 0.885 380.87 0.737 455.66 1.20 ↗️
4k/4k 1 0.251 337.87 0.091 940.63 2.76 ↗️
4 0.631 543.77 1.009 338.89 0.63 ↘️
8 1.117 615.03 0.925 739.65 1.21 ↗️
16 2.107 649.27 1.894 722.04 1.11 ↗️
32 4.1 634.36 3.855 709.46 1.06 ↗️
64 8.145 671.30 7.703 709.98 1.06 ↗️
8k/8k 1 0.686 499.63 0.289 1182.65 2.37 ↗️
4 1.965 697.49 1.187 1155.99 1.66 ↗️
8 4.453 715.24 2.394 1144.85 1.86 ↗️
16 8.715 633.78 4.838 1133.41 1.80 ↗️
32 14.811 653.70 9.895 1108.36 1.50 ↗️
64 29.418 743.13 20.526 1069.04 1.43 ↗️
64k/64k 1 31.981 680.53 17.612 1248.16 1.82 ↗️
4 120.059 716.06 70.283 1251.04 1.71 ↗️
8 250.338 700.49 140.201 1254.40 1.79 ↗️
16 498.031 767.19 280.264 1254.98 1.78 ↗️
32 997.827 706.07 560.843 1254.31 1.78 ↗️

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: guanbao <[email protected]>
@gbyu-amd gbyu-amd force-pushed the guanbao/mha_dispatch branch from 7ab33a4 to d08ac9f Compare October 30, 2025 03:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants