Skip to content

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Oct 17, 2025

This pull request updates the FlashAttention WebGPU implementation to improve support for indirect dispatch. The main changes ensure that when indirect dispatch is used, the shader receives the actual workgroup dimensions from an input buffer rather than relying on built-in variables, which avoids duplication overhead in Dawn/WebGPU. See https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/ComputePassEncoder.cpp;l=275.
This PR fixes the issue that indirect dispatch is slower than normal dispatch for the same program.
With this change, the phi4 with graph capture enabled can run 145 tps from 125 tps on NV 5080.

@qjia7 qjia7 marked this pull request as ready for review October 17, 2025 05:48
@qjia7 qjia7 requested review from fs-eire and guschmue October 17, 2025 05:48
@qjia7 qjia7 requested a review from fs-eire October 17, 2025 08:17
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Oct 20, 2025
@fs-eire fs-eire merged commit 8ab27d9 into main Oct 20, 2025
106 of 117 checks passed
@fs-eire fs-eire deleted the num_workgroups branch October 20, 2025 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants