Simplify Flash Attention Decode benchmarks generation #437

muhammad-tanvir-1211 · 2025-06-19T14:35:30Z

This PR removes a lot of redundant boiler plate code required to generate the benchmarks for Flash Attention Decode.

…tlass-fork into flash_decode_separate_out_configs

This is needed because today's release does not include the binaries used for CI

This pull splits the example for prefill attention with cachedkv into separate executables. --------- Co-authored-by: Muhammad Tanvir <[email protected]>

…ntel#426) On certain conditions a nightly tag for DPCPP may exist without providing the binary packages for linux. This artifact can be found in the API description for the release tag, under the `assets` section. This PR filters release tags and avoids the ones without the binary package `sycl_linux.tar.gz` It also reverts the workaround to avoid this failure during the day it was discovered. --------- Co-authored-by: Alejandro Acosta <[email protected]>

@aacostadiaz

**Summary** This PR mainly adds the `BF16BF16FP32` CUTE Example on `BMG`. It supports input format with `TT`, `NT` and `TN`. **Test Plan** ``` ./examples/cute/tutorial/cute_tutorial_bmg 1024 128256 4096 'T' 'T' ./examples/cute/tutorial/cute_tutorial_bmg 1024 128256 4096 'N' 'T' ./examples/cute/tutorial/cute_tutorial_bmg 1024 128256 4096 'T' 'N' ``` Co-author with @aacostadiaz @jiyang1011 --------- Co-authored-by: Alejandro Acosta <[email protected]>

…tlass-fork into flash_decode_separate_out_configs

…m/muhammad-tanvir-1211/cutlass-fork into flash_decode_separate_out_configs

* Reduce cpu threads used for build

…tlass-fork into flash_decode_separate_out_configs

…tlass-fork into flash_decode_simplify_benchmarks

joeatodd

Looks good but I haven't had time yet for an in depth review. I'll just drop these CMake suggestions for now 👍

joeatodd · 2025-06-25T13:06:36Z

.github/workflows/intel_test.yml

@@ -95,7 +95,8 @@ jobs:
          cmake -G Ninja  \
            -DCUTLASS_ENABLE_SYCL=ON \
            -DDPCPP_SYCL_TARGET=${{ matrix.sycl_target }} \
-            -DCUTLASS_SYCL_RUNNING_CI=ON
+            -DCUTLASS_SYCL_RUNNING_CI=ON \


CUTLASS_SYCL_RUNNING_CI doesn't seem to do anything as far as I can tell?

joeatodd · 2025-06-25T14:41:40Z

test/unit/flash_attention/flash_attention_decode/CMakeLists.txt

-  cutlass_test_unit_flash_attention_decode_h96_xe
-  cutlass_test_unit_flash_attention_decode_h128_xe
-  cutlass_test_unit_flash_attention_decode_h192_xe
+  cutlass_test_unit_flash_attention_decode_bf16_fp32_fp32_h64_512_xe


You can avoid having to specify all these (& ease maintainability in future) by constructing two lists of dependencies for cutlass_test_unit_flash_attention_decode and test_unit_flash_attention_decode as follows:

At the top of the file define e.g.:

set(TEST_EXES "") set(TEST_RUNS "")

Then last thing in the nested loop:

list(APPEND TEST_EXES ${out_exe}) string(REGEX REPLACE cutlass_ "" out_exe_snipped ${out_exe}) #borrowed from cutlass_test_unit_add_executable list(APPEND TEST_RUNS ${out_exe_snipped})

Then change the add custom targets to:

add_custom_target( cutlass_test_unit_flash_attention_decode DEPENDS ${EXES} ) add_custom_target( test_unit_flash_attention_decode DEPENDS ${RUNS} )

addressed this change in #408

…tlass-fork into flash_decode_simplify_benchmarks

muhammad-tanvir-1211 and others added 21 commits June 5, 2025 16:03

Add more tests and benchmark configurations

2a5d95c

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

beabe4d

…tlass-fork into flash_decode_separate_out_configs

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

55030f0

…tlass-fork into flash_decode_separate_out_configs

Merge branch 'sycl-develop' into flash_decode_separate_out_configs

df9a1e1

Fix license year

d30c6be

Workaround to skip today's DPCPP nightly on CI (intel#425)

338d7fe

This is needed because today's release does not include the binaries used for CI

Split example for prefill attention with cachedkv (intel#409)

a1811a4

This pull splits the example for prefill attention with cachedkv into separate executables. --------- Co-authored-by: Muhammad Tanvir <[email protected]>

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

580e8c8

…tlass-fork into flash_decode_separate_out_configs

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

2ad93de

…tlass-fork into flash_decode_separate_out_configs

Simplify test generation

ea4376f

Merge branch 'sycl-develop' into flash_decode_separate_out_configs

540084a

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

ae03894

…tlass-fork into flash_decode_separate_out_configs

Fix benchmark api

0975c01

Merge branch 'flash_decode_separate_out_configs' of https://github.co…

188fdce

…m/muhammad-tanvir-1211/cutlass-fork into flash_decode_separate_out_configs

Fix benchmark names

ff198f5

* Reduce cpu threads used for build

Change intel workflow

4d446bb

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

30c3a79

…tlass-fork into flash_decode_separate_out_configs

Simplify benchmark generation

7f45907

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

90d7637

…tlass-fork into flash_decode_simplify_benchmarks

muhammad-tanvir-1211 requested a review from a team June 19, 2025 14:35

t4c1 approved these changes Jun 20, 2025

View reviewed changes

muhammad-tanvir-1211 added 5 commits June 20, 2025 09:32

Increase timeout

4327493

Added check for head_size_vo

a859948

Fix the CI

a9173c0

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

e4f8462

…tlass-fork into flash_decode_simplify_benchmarks

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

c64c66f

…tlass-fork into flash_decode_simplify_benchmarks

joeatodd reviewed Jun 25, 2025

View reviewed changes

Merge branch 'sycl-develop' of https://github.com/codeplaysoftware/cu…

8568330

…tlass-fork into flash_decode_simplify_benchmarks

muhammad-tanvir-1211 added 2 commits July 8, 2025 09:22

Remove test changes, hardcode head_size_vo

287b5af

Merge branch 'sycl-develop' into flash_decode_simplify_benchmarks

96ba5a7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplify Flash Attention Decode benchmarks generation #437

Simplify Flash Attention Decode benchmarks generation #437

Uh oh!

muhammad-tanvir-1211 commented Jun 19, 2025 •

edited

Loading

Uh oh!

joeatodd left a comment

Uh oh!

joeatodd Jun 25, 2025

Uh oh!

joeatodd Jun 25, 2025

Uh oh!

muhammad-tanvir-1211 Jul 8, 2025

Uh oh!

Uh oh!

Simplify Flash Attention Decode benchmarks generation #437

Are you sure you want to change the base?

Simplify Flash Attention Decode benchmarks generation #437

Uh oh!

Conversation

muhammad-tanvir-1211 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joeatodd left a comment

Choose a reason for hiding this comment

Uh oh!

joeatodd Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

joeatodd Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

muhammad-tanvir-1211 Jul 8, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

muhammad-tanvir-1211 commented Jun 19, 2025 •

edited

Loading