Skip to content

Conversation

xiaofeihan1
Copy link
Contributor

@xiaofeihan1 xiaofeihan1 commented Oct 17, 2025

Description

This PR integrates GeneratePositionIDs into FusedQKRotaryEmbedding, improving overall performance.

Motivation and Context

Previously, for GQA, the processing flow was:
SplitPackedQKVProgram -> GeneratePositionIDs -> FusedQKRotaryEmbedding -> FlashAttention

After this change, the pipeline becomes:
SplitPackedQKVProgram -> FusedQKRotaryEmbedding -> FlashAttention

By fusing GeneratePositionIDs into FusedQKRotaryEmbedding, we reduce kernel launches and memory operations.

On NV5080, generation TPS improves by ~4% (from 135.6 tps to 141.2 tps).

@xiaofeihan1 xiaofeihan1 added the ep:WebGPU ort-web webgpu provider label Oct 17, 2025
@xiaofeihan1 xiaofeihan1 force-pushed the xiaofeihan/opt_generate_id branch from 0fb1cd4 to ec3292b Compare October 17, 2025 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant