[webgpu] Fused GeneratePositionIDs into FusedQKRotaryEmbedding #26335

xiaofeihan1 · 2025-10-17T05:52:40Z

This PR integrates GeneratePositionIDs into FusedQKRotaryEmbedding, improving overall performance.

Motivation and Context

Previously, for GQA, the processing flow was:
SplitPackedQKVProgram -> GeneratePositionIDs -> FusedQKRotaryEmbedding -> FlashAttention

After this change, the pipeline becomes:
SplitPackedQKVProgram -> FusedQKRotaryEmbedding -> FlashAttention

By fusing GeneratePositionIDs into FusedQKRotaryEmbedding, we reduce kernel launches and memory operations.

On NV5080, generation TPS improves by ~4% (from 135.6 tps to 141.2 tps).

xiaofeihan1 added the ep:WebGPU ort-web webgpu provider label Oct 17, 2025

implement

ec3292b

xiaofeihan1 force-pushed the xiaofeihan/opt_generate_id branch from 0fb1cd4 to ec3292b Compare October 17, 2025 05:57

Provide feedback