Flash Attention Buffer Compute Shader for Vulkan Backend Delegate #12654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

leafs1 merged 1 commit into pytorch:main from leafs1:export-D78586517

Jul 24, 2025

Contributor

leafs1 commented Jul 18, 2025 •

edited by pytorch-bot bot

Loading

Summary: Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517

cc @SS-JIA @manuelcandales @cbilgin

leafs1 requested a review from SS-JIA as a code owner

July 18, 2025 21:37

pytorch-bot bot added the module: vulkan label

pytorch-bot bot commented Jul 18, 2025 •

edited

Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12654

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Pending, 1 Unrelated Failure

As of commit 2144c46 with merge base 154994c ():

NEW FAILURES - The following jobs have failed:

pull / android / run-emulator (gh)
The process '/usr/bin/sh' failed with exit code 255
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 953ddf23b76b3f84b90b36a1a5b022969c982db691f2f8af7700e3ec779b9b0a /exec failed with exit code 5

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / unittest-editable / linux / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-cla bot added the CLA Signed label

Contributor

facebook-github-bot commented Jul 18, 2025

This pull request was exported from Phabricator. Differential Revision: D78586517

facebook-github-bot added the fb-exported label

leafs1 force-pushed the export-D78586517 branch from f2adda1 to 52f5541 Compare

July 21, 2025 21:33

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Buffer Compute Shader for Vulkan Backend Delegate (py…

52f5541

…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517

leafs1 force-pushed the export-D78586517 branch from 52f5541 to 85ee101 Compare

July 21, 2025 21:36

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Buffer Compute Shader for Vulkan Backend Delegate (py…

85ee101

…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517

Contributor

facebook-github-bot commented Jul 21, 2025

This pull request was exported from Phabricator. Differential Revision: D78586517

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Buffer Compute Shader for Vulkan Backend Delegate (py…

4fd4ce0

…torch#12654)

Summary:
Pull Request resolved: pytorch#12654

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517

leafs1 force-pushed the export-D78586517 branch from 85ee101 to 4fd4ce0 Compare

July 21, 2025 21:37

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Buffer Compute Shader for Vulkan Backend Delegate (py…

8efd477

…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517

leafs1 force-pushed the export-D78586517 branch from 4fd4ce0 to 8efd477 Compare

July 21, 2025 21:38

Contributor

facebook-github-bot commented Jul 21, 2025

This pull request was exported from Phabricator. Differential Revision: D78586517

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Buffer Compute Shader for Vulkan Backend Delegate (py…

1d93c72

…torch#12654)

Summary:
Pull Request resolved: pytorch#12654

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517

leafs1 force-pushed the export-D78586517 branch from 8efd477 to 1d93c72 Compare

July 21, 2025 21:42

Contributor

facebook-github-bot commented Jul 21, 2025

This pull request was exported from Phabricator. Differential Revision: D78586517

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Buffer Compute Shader for Vulkan Backend Delegate (py…

c448552

…torch#12654)

Summary:
Pull Request resolved: pytorch#12654

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517

leafs1 force-pushed the export-D78586517 branch from 1d93c72 to c448552 Compare

July 21, 2025 21:49

SS-JIA approved these changes

View reviewed changes

leafs1 added a commit to leafs1/executorch that referenced this pull request


          Flash Attention Buffer Compute Shader for Vulkan Backend Delegate (py…

275ec3d

…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78586517

leafs1 force-pushed the export-D78586517 branch from c448552 to 275ec3d Compare

July 23, 2025 21:45

Contributor

facebook-github-bot commented Jul 23, 2025

This pull request was exported from Phabricator. Differential Revision: D78586517

Contributor Author

leafs1 commented Jul 23, 2025

@pytorchbot label "release notes: none"

pytorch-bot bot added the release notes: none label


          Flash Attention Buffer Compute Shader for Vulkan Backend Delegate (py…

2144c46

…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78586517

leafs1 force-pushed the export-D78586517 branch from 275ec3d to 2144c46 Compare

July 23, 2025 22:37

Contributor

facebook-github-bot commented Jul 23, 2025

This pull request was exported from Phabricator. Differential Revision: D78586517

leafs1 merged commit 0b8d99f into pytorch:main

163 of 176 checks passed

Conarnar pushed a commit to Conarnar/executorch that referenced this pull request


          Flash Attention Buffer Compute Shader for Vulkan Backend Delegate (py…

bf93115

…torch#12654)

Summary: Built flash attention compute shader for Vulkan backend
delegate. The current implementation only supports buffer storage and is
not fully optimized, but is functional. This shader should speed up the
SDPA process in the attention block of transformer inferencing as the
previous implementation used many i/o operations. The implementation
includes proper multi-query attention support for models like LLaMA,
uses tiled block processing to reduce memory usage, and replaces
multiple separate operations (matmul, softmax, masking) with a single
efficient compute shader.

Differential Revision: D78586517




cc @SS-JIA @manuelcandales @cbilgin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported module: vulkan release notes: none