[MHA] add mha dispatch logic #776

gbyu-amd · 2025-10-30T02:59:00Z

Purpose

Two different MHA kernel implementations are available on ROCm GPUs, triton kernel and aiter kernels (including ck and asm, dispatched inside AITER). We benchmark the performance of them and observe the gap in various cases, based on which a simple dispatch logic is added in this pr.

Benchmark result

We benchmark the shapes from deepseek-v3 in TP8 scenario on MI355 for now, i.e., num_heads=16, qk_head_dim=192, v_head_dim=128. The seq_len and batch size range from 1k to 64k and 1 to 64, respectively.
Here aiter kernel actually corresponds to the FA3 asm kernel. Basically, the asm kernel demonstrates superior performance for seq_len of 4k and above, while triton kernel performs better with relatively short seq_len like 1k.

q_seqlen/kv_seqlen	BS	triton time (ms)	triton Tflops	aiter time (ms)	aiter Tflops	aiter vs. triton time speedup
1k/1k	1	0.077	68.81	0.024	219.89	3.21 ↗️
	4	0.091	230.37	0.114	184.22	0.80 ↘️
	8	0.165	255.03	0.345	122.48	0.48 ↘️
	16	0.269	316.36	0.688	122.59	0.39 ↘️
	32	0.472	361.22	1.554	108.59	0.30 ↘️
	64	0.885	380.87	0.737	455.66	1.20 ↗️
4k/4k	1	0.251	337.87	0.091	940.63	2.76 ↗️
	4	0.631	543.77	1.009	338.89	0.63 ↘️
	8	1.117	615.03	0.925	739.65	1.21 ↗️
	16	2.107	649.27	1.894	722.04	1.11 ↗️
	32	4.1	634.36	3.855	709.46	1.06 ↗️
	64	8.145	671.30	7.703	709.98	1.06 ↗️
8k/8k	1	0.686	499.63	0.289	1182.65	2.37 ↗️
	4	1.965	697.49	1.187	1155.99	1.66 ↗️
	8	4.453	715.24	2.394	1144.85	1.86 ↗️
	16	8.715	633.78	4.838	1133.41	1.80 ↗️
	32	14.811	653.70	9.895	1108.36	1.50 ↗️
	64	29.418	743.13	20.526	1069.04	1.43 ↗️
64k/64k	1	31.981	680.53	17.612	1248.16	1.82 ↗️
	4	120.059	716.06	70.283	1251.04	1.71 ↗️
	8	250.338	700.49	140.201	1254.40	1.79 ↗️
	16	498.031	767.19	280.264	1254.98	1.78 ↗️
	32	997.827	706.07	560.843	1254.31	1.78 ↗️

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: guanbao <[email protected]>

gbyu-amd requested review from kliuae-amd, tjtanaavllm, wuhuikx and zejunchen-zejun as code owners October 30, 2025 02:59

add mha dispatch logic

d08ac9f

Signed-off-by: guanbao <[email protected]>

gbyu-amd force-pushed the guanbao/mha_dispatch branch from 7ab33a4 to d08ac9f Compare October 30, 2025 03:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MHA] add mha dispatch logic #776

[MHA] add mha dispatch logic #776

Uh oh!

gbyu-amd commented Oct 30, 2025 •

edited by github-actions bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[MHA] add mha dispatch logic #776

Are you sure you want to change the base?

[MHA] add mha dispatch logic #776

Uh oh!

Conversation

gbyu-amd commented Oct 30, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Benchmark result

Test Result

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gbyu-amd commented Oct 30, 2025 •

edited by github-actions bot

Loading