Skip to content

Conversation

sanchitintel
Copy link

@sanchitintel sanchitintel commented Sep 12, 2025

Fixes #500

This new config is being supported: A matrices are BF16, B matrices are BF16, C matrices are FP32, and D matrices are BF16. The conversion of output from FP32 to BF16 happens in the epilogue.

Just one line change in include/cutlass/epilogue/collective/builders/xe_builder.inl to enable dtype conversion in epilogue for GroupedGEMM in the cutlass headers, but the GroupedGEMM example from examples/04_bmg_grouped_gemm/04_bmg_grouped_gemm_bf16_output.cpp has been copy-pasted all over again (please use a diff tool such as BeyondCompare to see the difference in both files) to create a new file with a few lines of code changes that I mostly adapted/copy-pasted from https://github.com/intel/cutlass-sycl/blob/e83f147263dd8ca3589b34d76ce6fbec58bac048/test/unit/gemm/device/default_gemm_group_configuration.hpp.

Ideally, we should retain one example & test different output dtypes in it. I'm open to making such a change.

Thanks!

Comment on lines +82 to +88

- name: Upload Sarif Artifact
uses: actions/upload-artifact@ea165f8d65b6e75b540449e92b4886f43607fa02 # v4.6.2
with:
name: codeql-results-${{ matrix.language }}
path: ./results/${{ matrix.language }}.sarif
retention-days: 7
Copy link
Author

@sanchitintel sanchitintel Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't modify this file. Maybe it was modified by some GitHub Action

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it wasn't. You might have made a mistake rebasing your branch locally. Please revert this change.

@sanchitintel sanchitintel changed the title Support dtype conversion in epilogue for GroupedGEMM Support FP32 -> BF16 conversion in epilogue of GroupedGEMM Sep 12, 2025
Comment on lines +628 to +636
using EpilogueOp =
cutlass::epilogue::fusion::LinearCombination<float_t, float_t>;

using CollectiveEpilogue =
typename cutlass::epilogue::collective::CollectiveBuilder<
cutlass::arch::IntelXe, cutlass::arch::OpClassTensorOp, TileShape,
Shape<_1, _1, _1>, cutlass::epilogue::collective::EpilogueTileAuto,
float, float, float, LayoutC, 1, ElementOutput, LayoutC, 1,
EpilogueDispatchPolicy, EpilogueOp>::CollectiveOp;
Copy link
Author

@sanchitintel sanchitintel Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from ElementOutput being bfloat16_t, this is the only difference between the vanilla example, and this one.

@rolandschulz, #482 currently doesn't support this case (BF16 output with BF16 inputs & FP32 accum).
I explained this change here.

Also, please advise if I should combine the two examples (different output dtype) in one file.

Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we want to combine them into one example if they are almost identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Unable to convert output dtype from FP32 to BF16 in Group GEMM epilogue
2 participants