Skip to content

Flash Attention Buffer Compute Shader for Vulkan Backend Delegate #12654

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 24, 2025

Conversation

leafs1
Copy link
Contributor

@leafs1 leafs1 commented Jul 18, 2025

Summary: Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517

cc @SS-JIA @manuelcandales @cbilgin

@leafs1 leafs1 requested a review from SS-JIA as a code owner July 18, 2025 21:37
@pytorch-bot pytorch-bot bot added the module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ label Jul 18, 2025
Copy link

pytorch-bot bot commented Jul 18, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/12654

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 3 Pending, 1 Unrelated Failure

As of commit 2144c46 with merge base 154994c (image):

NEW FAILURES - The following jobs have failed:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 18, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78586517

@leafs1 leafs1 force-pushed the export-D78586517 branch from f2adda1 to 52f5541 Compare July 21, 2025 21:33
leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 21, 2025
…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517
@leafs1 leafs1 force-pushed the export-D78586517 branch from 52f5541 to 85ee101 Compare July 21, 2025 21:36
leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 21, 2025
…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78586517

leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 21, 2025
…torch#12654)

Summary:
Pull Request resolved: pytorch#12654

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517
@leafs1 leafs1 force-pushed the export-D78586517 branch from 85ee101 to 4fd4ce0 Compare July 21, 2025 21:37
leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 21, 2025
…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517
@leafs1 leafs1 force-pushed the export-D78586517 branch from 4fd4ce0 to 8efd477 Compare July 21, 2025 21:38
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78586517

leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 21, 2025
…torch#12654)

Summary:
Pull Request resolved: pytorch#12654

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517
@leafs1 leafs1 force-pushed the export-D78586517 branch from 8efd477 to 1d93c72 Compare July 21, 2025 21:42
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78586517

leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 21, 2025
…torch#12654)

Summary:
Pull Request resolved: pytorch#12654

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Differential Revision: D78586517
@leafs1 leafs1 force-pushed the export-D78586517 branch from 1d93c72 to c448552 Compare July 21, 2025 21:49
leafs1 added a commit to leafs1/executorch that referenced this pull request Jul 23, 2025
…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78586517
@leafs1 leafs1 force-pushed the export-D78586517 branch from c448552 to 275ec3d Compare July 23, 2025 21:45
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78586517

@leafs1
Copy link
Contributor Author

leafs1 commented Jul 23, 2025

@pytorchbot label "release notes: none"

@pytorch-bot pytorch-bot bot added the release notes: none Do not include this in the release notes label Jul 23, 2025
…torch#12654)

Summary:

Built flash attention compute shader for Vulkan backend delegate. The current implementation only supports buffer storage and is not fully optimized, but is functional. This shader should speed up the SDPA process in the attention block of transformer inferencing as the previous implementation used many i/o operations. The implementation includes proper multi-query attention support for models like LLaMA, uses tiled block processing to reduce memory usage, and replaces multiple separate operations (matmul, softmax, masking) with a single efficient compute shader.

Reviewed By: SS-JIA

Differential Revision: D78586517
@leafs1 leafs1 force-pushed the export-D78586517 branch from 275ec3d to 2144c46 Compare July 23, 2025 22:37
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D78586517

@leafs1 leafs1 merged commit 0b8d99f into pytorch:main Jul 24, 2025
163 of 176 checks passed
Conarnar pushed a commit to Conarnar/executorch that referenced this pull request Jul 25, 2025
…torch#12654)

Summary: Built flash attention compute shader for Vulkan backend
delegate. The current implementation only supports buffer storage and is
not fully optimized, but is functional. This shader should speed up the
SDPA process in the attention block of transformer inferencing as the
previous implementation used many i/o operations. The implementation
includes proper multi-query attention support for models like LLaMA,
uses tiled block processing to reduce memory usage, and replaces
multiple separate operations (matmul, softmax, masking) with a single
efficient compute shader.

Differential Revision: D78586517




cc @SS-JIA @manuelcandales @cbilgin
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported module: vulkan Issues related to the Vulkan delegate and code under backends/vulkan/ release notes: none Do not include this in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants