-
Notifications
You must be signed in to change notification settings - Fork 49
Simplify Flash Attention Decode benchmarks generation #437
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Simplify Flash Attention Decode benchmarks generation #437
Conversation
…tlass-fork into flash_decode_separate_out_configs
…tlass-fork into flash_decode_separate_out_configs
This is needed because today's release does not include the binaries used for CI
This pull splits the example for prefill attention with cachedkv into separate executables. --------- Co-authored-by: Muhammad Tanvir <[email protected]>
…ntel#426) On certain conditions a nightly tag for DPCPP may exist without providing the binary packages for linux. This artifact can be found in the API description for the release tag, under the `assets` section. This PR filters release tags and avoids the ones without the binary package `sycl_linux.tar.gz` It also reverts the workaround to avoid this failure during the day it was discovered. --------- Co-authored-by: Alejandro Acosta <[email protected]>
**Summary** This PR mainly adds the `BF16BF16FP32` CUTE Example on `BMG`. It supports input format with `TT`, `NT` and `TN`. **Test Plan** ``` ./examples/cute/tutorial/cute_tutorial_bmg 1024 128256 4096 'T' 'T' ./examples/cute/tutorial/cute_tutorial_bmg 1024 128256 4096 'N' 'T' ./examples/cute/tutorial/cute_tutorial_bmg 1024 128256 4096 'T' 'N' ``` Co-author with @aacostadiaz @jiyang1011 --------- Co-authored-by: Alejandro Acosta <[email protected]>
…tlass-fork into flash_decode_separate_out_configs
…tlass-fork into flash_decode_separate_out_configs
…tlass-fork into flash_decode_separate_out_configs
…m/muhammad-tanvir-1211/cutlass-fork into flash_decode_separate_out_configs
* Reduce cpu threads used for build
…tlass-fork into flash_decode_separate_out_configs
…tlass-fork into flash_decode_simplify_benchmarks
…tlass-fork into flash_decode_simplify_benchmarks
…tlass-fork into flash_decode_simplify_benchmarks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good but I haven't had time yet for an in depth review. I'll just drop these CMake suggestions for now 👍
@@ -95,7 +95,8 @@ jobs: | |||
cmake -G Ninja \ | |||
-DCUTLASS_ENABLE_SYCL=ON \ | |||
-DDPCPP_SYCL_TARGET=${{ matrix.sycl_target }} \ | |||
-DCUTLASS_SYCL_RUNNING_CI=ON | |||
-DCUTLASS_SYCL_RUNNING_CI=ON \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUTLASS_SYCL_RUNNING_CI
doesn't seem to do anything as far as I can tell?
cutlass_test_unit_flash_attention_decode_h96_xe | ||
cutlass_test_unit_flash_attention_decode_h128_xe | ||
cutlass_test_unit_flash_attention_decode_h192_xe | ||
cutlass_test_unit_flash_attention_decode_bf16_fp32_fp32_h64_512_xe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid having to specify all these (& ease maintainability in future) by constructing two lists of dependencies for cutlass_test_unit_flash_attention_decode
and test_unit_flash_attention_decode
as follows:
At the top of the file define e.g.:
set(TEST_EXES "")
set(TEST_RUNS "")
Then last thing in the nested loop:
list(APPEND TEST_EXES ${out_exe})
string(REGEX REPLACE cutlass_ "" out_exe_snipped ${out_exe}) #borrowed from cutlass_test_unit_add_executable
list(APPEND TEST_RUNS ${out_exe_snipped})
Then change the add custom targets to:
add_custom_target(
cutlass_test_unit_flash_attention_decode
DEPENDS
${EXES}
)
add_custom_target(
test_unit_flash_attention_decode
DEPENDS
${RUNS}
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed this change in #408
…tlass-fork into flash_decode_simplify_benchmarks
This PR removes a lot of redundant boiler plate code required to generate the benchmarks for Flash Attention Decode.