Skip to content

Conversation

muhammad-tanvir-1211
Copy link

@muhammad-tanvir-1211 muhammad-tanvir-1211 commented Jun 19, 2025

This PR removes a lot of redundant boiler plate code required to generate the benchmarks for Flash Attention Decode.

muhammad-tanvir-1211 and others added 21 commits June 5, 2025 16:03
This is needed because today's release does not include the binaries
used for CI
This pull splits the example for prefill attention with cachedkv into
separate executables.

---------

Co-authored-by: Muhammad Tanvir <[email protected]>
…ntel#426)

On certain conditions a nightly tag for DPCPP may exist without
providing the binary packages for linux. This artifact can be found in
the API description for the release tag, under the `assets` section.

This PR filters release tags and avoids the ones without the binary
package `sycl_linux.tar.gz`

It also reverts the workaround to avoid this failure during the day it
was discovered.

---------

Co-authored-by: Alejandro Acosta <[email protected]>
**Summary**
This PR mainly adds the `BF16BF16FP32` CUTE Example on `BMG`. It
supports input format with `TT`, `NT` and `TN`.

**Test Plan**
```
./examples/cute/tutorial/cute_tutorial_bmg 1024 128256 4096 'T' 'T'
./examples/cute/tutorial/cute_tutorial_bmg 1024 128256 4096 'N' 'T'
./examples/cute/tutorial/cute_tutorial_bmg 1024 128256 4096 'T' 'N'
```

Co-author with @aacostadiaz @jiyang1011

---------

Co-authored-by: Alejandro Acosta <[email protected]>
* Reduce cpu threads used for build
@muhammad-tanvir-1211 muhammad-tanvir-1211 requested a review from a team June 19, 2025 14:35
Copy link

@joeatodd joeatodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good but I haven't had time yet for an in depth review. I'll just drop these CMake suggestions for now 👍

@@ -95,7 +95,8 @@ jobs:
cmake -G Ninja \
-DCUTLASS_ENABLE_SYCL=ON \
-DDPCPP_SYCL_TARGET=${{ matrix.sycl_target }} \
-DCUTLASS_SYCL_RUNNING_CI=ON
-DCUTLASS_SYCL_RUNNING_CI=ON \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUTLASS_SYCL_RUNNING_CI doesn't seem to do anything as far as I can tell?

cutlass_test_unit_flash_attention_decode_h96_xe
cutlass_test_unit_flash_attention_decode_h128_xe
cutlass_test_unit_flash_attention_decode_h192_xe
cutlass_test_unit_flash_attention_decode_bf16_fp32_fp32_h64_512_xe

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid having to specify all these (& ease maintainability in future) by constructing two lists of dependencies for cutlass_test_unit_flash_attention_decode and test_unit_flash_attention_decode as follows:

At the top of the file define e.g.:

set(TEST_EXES "")
set(TEST_RUNS "")

Then last thing in the nested loop:

      list(APPEND TEST_EXES ${out_exe})
      string(REGEX REPLACE cutlass_ "" out_exe_snipped ${out_exe}) #borrowed from cutlass_test_unit_add_executable
      list(APPEND TEST_RUNS ${out_exe_snipped})

Then change the add custom targets to:

add_custom_target(
  cutlass_test_unit_flash_attention_decode
  DEPENDS
  ${EXES}
)

add_custom_target(
  test_unit_flash_attention_decode
  DEPENDS
  ${RUNS}
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed this change in #408

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants