[webgpu] Optimize FlashAttention for prefill #25395

daijh · 2025-07-15T02:52:38Z

Description

This PR enhances unidirectional FlashAttention by applying causal masking inside the main loop. This optimization eliminates unnecessary memory loads by avoiding future entries in the KV cache.

Testing on Lunar Lake shows up to a 20% performance improvement for phi-4-mini-accuracy4 (with a prompt of 4096). Similar performance gains were also observed for other models, including Qwen3-0.6B-accuracy4.

This PR now uses the more readable unidirectional attribute instead of is_gqa, to control causal masking.

Motivation and Context

See above.

This PR enhances unidirectional `FlashAttention` by applying causal masking inside the main loop. This optimization eliminates unnecessary memory loads by avoiding future entries in the KV cache. Testing on Lunar Lake shows up to a 20% performance improvement for `phi-4-mini-accuracy4` (with a prompt of 4096). Similar performance gains were also observed for other models, including `Qwen3-0.6B-accuracy4`. This PR now uses the more readable `unidirectional` attribute instead of `is_gpa`, to control causal masking.

daijh · 2025-07-15T02:54:22Z

Lunar Lake, Phi-4-mini-accuracy4:

Prompt	Default Prefill Speed (tps)	Opt Prefill Speed (tps)	Improvement
Prompt-1024	561.40	593.76	5.76%
Prompt-2048	498.60	549.59	10.23%
Prompt-3072	465.98	537.64	15.38%
Prompt-4096	430.61	513.40	19.23%

daijh · 2025-07-15T03:02:03Z

@sushraja-msft @qjia7 pls take a look.

cc @jchen10 @xhcao

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

onnxruntime/contrib_ops/webgpu/bert/attention_common.h

qjia7

LGTM with nits.

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

daijh · 2025-07-16T01:44:37Z

@guschmue @fs-eire
Please take a look.

fs-eire · 2025-07-22T17:13:28Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-07-22T17:13:49Z

Azure Pipelines successfully started running 5 pipeline(s).

### Description This PR enhances unidirectional `FlashAttention` by applying causal masking inside the main loop. This optimization eliminates unnecessary memory loads by avoiding future entries in the KV cache. Testing on Lunar Lake shows up to a 20% performance improvement for `phi-4-mini-accuracy4` (with a prompt of 4096). Similar performance gains were also observed for other models, including `Qwen3-0.6B-accuracy4`. This PR now uses the more readable `unidirectional` attribute instead of `is_gqa`, to control causal masking. ### Motivation and Context See above.

qjia7 reviewed Jul 15, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

qjia7 reviewed Jul 15, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/bert/attention_common.h Outdated Show resolved Hide resolved

qjia7 previously approved these changes Jul 15, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

Explicitly set is_unidirectional_ to true for GQA

7bd19af

daijh dismissed qjia7’s stale review via 7bd19af July 16, 2025 01:35

fs-eire approved these changes Jul 22, 2025

View reviewed changes

guschmue added the ep:WebGPU ort-web webgpu provider label Jul 29, 2025

guschmue merged commit 2bd00ec into microsoft:main Jul 29, 2025
89 of 92 checks passed

daijh deleted the optimize-flash-attention-for-prefill branch July 30, 2025 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[webgpu] Optimize FlashAttention for prefill #25395

[webgpu] Optimize FlashAttention for prefill #25395

daijh commented Jul 15, 2025 •

edited

Loading

Uh oh!

daijh commented Jul 15, 2025

Uh oh!

daijh commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 left a comment

Uh oh!

Uh oh!

daijh commented Jul 16, 2025

Uh oh!

fs-eire commented Jul 22, 2025

Uh oh!

azure-pipelines bot commented Jul 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[webgpu] Optimize FlashAttention for prefill #25395

[webgpu] Optimize FlashAttention for prefill #25395

Conversation

daijh commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

daijh commented Jul 15, 2025

Uh oh!

daijh commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

daijh commented Jul 16, 2025

Uh oh!

fs-eire commented Jul 22, 2025

Uh oh!

azure-pipelines bot commented Jul 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

daijh commented Jul 15, 2025 •

edited

Loading