Support fp32 accumulation for bf16 gemm and grouped gemm #482

wuxun-zhang · 2025-08-27T12:58:27Z

For current implementation, if users want a bf16-in-bf16-out gemm, the MMa atom XE_8x16x16_BF16BF16BF16BF16_TT will be used and accumulation happens in bf16 dtype. But for preserving good accuracy, fp32 accumulation is needed. This PR supports this by adding dtype conversion in epilogue. It adds assumption that C and D should have same dtype.

wuxun-zhang · 2025-08-29T03:20:19Z

@rolandschulz @tdeng5 Could you please take a review here? Thanks.

include/cutlass/epilogue/collective/xe_epilogue.hpp

include/cutlass/epilogue/collective/xe_array_epilogue.hpp

wuxun-zhang · 2025-09-02T01:24:38Z

@jiyang1011 Could you please have a review here? Thanks

jiyang1011 · 2025-09-03T01:40:35Z

https://github.com/intel/cutlass-sycl/blob/main/test/unit/gemm/device/xe_gemm_fp16_fp16_fp16_tensor_op_fp16.cpp

For 16bits accumulator mma, it's the UT. Supposed that the ElementAccumulator is hard set as float. It's obviously not rational

Previously bf16 in bf16 out gemm will use bf16 accmulator, after this patch fp32 accumulator will be used for good accuracy. There is an assumption that C and D have same dtypes. Signed-off-by: Wuxun Zhang <[email protected]>

wuxun-zhang · 2025-09-03T02:22:50Z

https://github.com/intel/cutlass-sycl/blob/main/test/unit/gemm/device/xe_gemm_fp16_fp16_fp16_tensor_op_fp16.cpp

For 16bits accumulator mma, it's the UT. Supposed that the ElementAccumulator is hard set as float. It's obviously not rational

Now it supports different accumulator dtype. Please check latest commits.

wuxun-zhang · 2025-09-04T00:38:24Z

@jiyang1011 @taozha2 can you please help trigger CI test here?

rolandschulz · 2025-09-04T20:51:37Z

examples/sycl/00_bmg_gemm/00_bmg_gemm.cpp


  using CollectiveEpilogue = typename Gemm::CollectiveEpilogue;
-  using ElementC = typename Gemm::ElementC;
+  using ElementC = typename CollectiveEpilogue::ElementOutput;


I think this change is wrong. Why change this? Did you check this is correct if dtype of C and output is different?

I think this issue can be detected by pre-ci checks.

It assums C and D has same dtype here. I also think we need to support more dtype combinations.

Why change it if you assume C and D is the same? We should add a static_assert if it only works under that assumption.

sanchitintel · 2025-09-12T00:25:49Z

if users want a bf16-in-bf16-out gemm, the MMa atom XE_8x16x16_BF16BF16BF16BF16_TT will be used and accumulation happens in bf16 dtype. But for preserving good accuracy, fp32 accumulation is needed. This PR supports this by adding dtype conversion in epilogue.

XE_8x16x16_F32BF16BF16F32_TT already supports FP32 accumulation.
Support for changing output dtype to BF16 in epilogue is already present, except for GroupedGEMM.
I opened #505 & #506 to illustrate (may have to manually compare code with a diff tool such as BeyondCompare).

It adds assumption that C and D should have same dtype.

I see now that this is the new feature.

Thank you!

sanchitintel · 2025-09-12T02:20:38Z

include/cutlass/epilogue/collective/xe_array_epilogue.hpp

+  using ElementC = typename FusionCallbacks::ElementSource;
+  using ElementAccumulator = ElementC_;


This config did not work with the latest commit of this branch. Is it currently supported? Thanks!

sanchitintel · 2025-09-12T02:31:21Z

test/unit/gemm/device/default_gemm_group_configuration.hpp


  using TiledMma = typename CollectiveMainloop::TiledMma;

-  using EpilogueOp = epilogue::fusion::LinearCombination<float, float>;


Currently, while the BF16 A, B and FP32 C with BF16 D/output case is supported in the main branch, the epilogue::fusion::LinearCombination API usage at this line is non-intuitive because the main branch is using a hacky way that deviates from the intended/documented use of this API, since its first template parameter is intended to be the output dtype.

Currently, the unwritten/implicit contract for this code in the current main branch seems to be:

intuitively thinking of it as computing D = alpha * Accum + beta * C in Float,

and then setting the correct ElementD parameter in the cutlass::epilogue::collective::CollectiveEpilogue can be thought of as converting to the correct output dtype (which is ElementOutput in this file).

It seems that when this PR would be ready, it will rectify the API usage of cutlass::epilogue::fusion::LinearCombination in this repo.

Thanks!

tdeng5 requested a review from taozha2 August 29, 2025 05:26

taozha2 reviewed Aug 29, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_epilogue.hpp Outdated Show resolved Hide resolved

taozha2 reviewed Aug 29, 2025

View reviewed changes

include/cutlass/epilogue/collective/xe_array_epilogue.hpp Outdated Show resolved Hide resolved

taozha2 requested a review from jiyang1011 August 29, 2025 05:44

wuxun-zhang force-pushed the wuxun/bf16-out-gemm branch from 82f8e45 to 838aae9 Compare September 2, 2025 08:43

taozha2 approved these changes Sep 2, 2025

View reviewed changes

wuxun-zhang added 4 commits September 3, 2025 09:55

support fp32 accumulation for bf16 gemm and grouped gemm

4736337

Previously bf16 in bf16 out gemm will use bf16 accmulator, after this patch fp32 accumulator will be used for good accuracy. There is an assumption that C and D have same dtypes. Signed-off-by: Wuxun Zhang <[email protected]>

updates

a8aefe0

address comment

8b36b71

support different accumulator dtype combinations

0d733e8

wuxun-zhang force-pushed the wuxun/bf16-out-gemm branch from dd3a205 to 0d733e8 Compare September 3, 2025 02:08

fix example

04e3da0

fix tests

a1a6121

taozha2 requested a review from rolandschulz September 4, 2025 00:47

Merge branch 'main' into wuxun/bf16-out-gemm

3998cb9

rolandschulz reviewed Sep 4, 2025

View reviewed changes

Merge branch 'main' into wuxun/bf16-out-gemm

28cb9fb

sanchitintel reviewed Sep 12, 2025

View reviewed changes

sanchitintel mentioned this pull request Sep 12, 2025

Support FP32 -> BF16 conversion in epilogue of GroupedGEMM #505

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support fp32 accumulation for bf16 gemm and grouped gemm #482

Support fp32 accumulation for bf16 gemm and grouped gemm #482

Uh oh!

wuxun-zhang commented Aug 27, 2025

Uh oh!

wuxun-zhang commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

wuxun-zhang commented Sep 2, 2025

Uh oh!

jiyang1011 commented Sep 3, 2025

Uh oh!

wuxun-zhang commented Sep 3, 2025

Uh oh!

wuxun-zhang commented Sep 4, 2025

Uh oh!

rolandschulz Sep 4, 2025

Uh oh!

taozha2 Sep 5, 2025

Uh oh!

wuxun-zhang Sep 5, 2025

Uh oh!

rolandschulz Sep 5, 2025

Uh oh!

sanchitintel commented Sep 12, 2025 •

edited

Loading

Uh oh!

sanchitintel Sep 12, 2025 •

edited

Loading

Uh oh!

sanchitintel Sep 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

		using ElementC = typename FusionCallbacks::ElementSource;
		using ElementAccumulator = ElementC_;


		using TiledMma = typename CollectiveMainloop::TiledMma;

		using EpilogueOp = epilogue::fusion::LinearCombination<float, float>;

Support fp32 accumulation for bf16 gemm and grouped gemm #482

Are you sure you want to change the base?

Support fp32 accumulation for bf16 gemm and grouped gemm #482

Uh oh!

Conversation

wuxun-zhang commented Aug 27, 2025

Uh oh!

wuxun-zhang commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

wuxun-zhang commented Sep 2, 2025

Uh oh!

jiyang1011 commented Sep 3, 2025

Uh oh!

wuxun-zhang commented Sep 3, 2025

Uh oh!

wuxun-zhang commented Sep 4, 2025

Uh oh!

rolandschulz Sep 4, 2025

Choose a reason for hiding this comment

Uh oh!

taozha2 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

wuxun-zhang Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

rolandschulz Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

sanchitintel commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanchitintel Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanchitintel Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sanchitintel commented Sep 12, 2025 •

edited

Loading

sanchitintel Sep 12, 2025 •

edited

Loading

sanchitintel Sep 12, 2025 •

edited

Loading