Support of fp8_scaled_mm() on XPU #34

adityachatter · 2025-11-06T09:49:32Z

Add support for op sgl_kernel.fp8_scaled_mm() on XPU.

Supports:

Fused GEMM with FP8 scaling
Input dtypes: FP8 E4M3 or FP8 E5M2
Output dtypes: BF16, FP32, FP8 E4M3, FP8 E5M2
Per-row scaling of A, Per-column scaling of B, Per-column bias

Run the fp8_scaled_mm test code as:

cd ~/sgl-kernel-xpu/tests
python -m pytest -v -s test_fp8_scaled_mm_xpu.py

Tested on BMG B580:

2000 passed

fp8_scaled_mm designed for FP8 DeepSeek inference requirement.

* Added support for sgl_kernel.fp8_scaled_mm op * Input in dtype fp8 e4m3 or e5m2 * Output in dtype fp32, bf16, fp8 e4m3 or fp8 e5m2 Signed-off-by: Aditya Chatterjee <[email protected]>

src/sycl/fp8_scaled_mm.cpp

airMeng

Have you compared with OneDNN's FP8 scaled_mm, which I think we can reuse PyTorch's effort?

airMeng · 2025-11-10T23:59:51Z

CMakeLists.txt

+set(FETCHCONTENT_MAKEAVAILABLE_SERIAL FALSE)
+FetchContent_MakeAvailable(repo-cutlass-sycl)
+file(COPY ${repo-cutlass-sycl_SOURCE_DIR}/cmake/onemkl.cmake
+     DESTINATION ${CMAKE_CURRENT_SOURCE_DIR}/cmake)
+set(FETCHCONTENT_MAKEAVAILABLE_SERIAL TRUE)
+FetchContent_MakeAvailable(repo-cutlass-sycl)


MKL has been disabled in the latest cutlass-sycl, you can remove these

airMeng · 2025-11-11T00:00:32Z

src/sycl/helper.h

remove this file

mingfeima · 2025-11-11T01:52:06Z

this PR would be quite slow on current platform of intel GPUs. (even for CRI i believe it requires quite a lot of change to be performant).

Are you planning to provide functional support here? @adityachatter

kareemshaik80 · 2025-11-11T02:04:23Z

this PR would be quite slow on current platform of intel GPUs. (even for CRI i believe it requires quite a lot of change to be performant).

Are you planning to provide functional support here? @adityachatter

@mingfeima, the target is mostly functional here. Yes CRI will have optimal solution for any fp8 support.

mingfeima · 2025-11-11T02:14:22Z

this PR would be quite slow on current platform of intel GPUs. (even for CRI i believe it requires quite a lot of change to be performant).
Are you planning to provide functional support here? @adityachatter

@mingfeima, the target is mostly functional here. Yes CRI will have optimal solution for any fp8 support.

@kareemshaik80 OK I see. Please put this on a developing branch, maybe named after dev_xe3p or whatever. We still expect good performance on current intel gpu hardwares such as b58, b60.

addtionally, these are a few APIs mismatches with sglang:

per block quantization is the most common recipe right now
out data type is bfloat16 or float16 (bfloat16 is commonly used)

kareemshaik80 · 2025-11-11T02:23:35Z

per block quantization is the most comm

right, this is mainly for BMG here, will evaluate performance. by the way per block quantization/scale is different api will have different implementation.

mingfeima · 2025-11-11T02:31:20Z

src/sycl/fp8_scaled_mm.cpp

+    float beta = 0.0f;
+
+    // Create a dummy C tensor
+    cutlass::device_memory::allocation<ElementC> dummy_C(M * N);


avoid direct memory allocation from sycl runtime, use torch factory function.

mingfeima · 2025-11-11T02:31:44Z

src/sycl/fp8_scaled_mm.cpp

+        {static_cast<ElementA*>(mat_a.data_ptr()),
+         stride_A,
+         static_cast<ElementB*>(mat_b.data_ptr()),
+         stride_B,
+         static_cast<ElementScale*>(scales_a.data_ptr()),
+         stride_SA,
+         static_cast<ElementScale*>(scales_b.data_ptr()),
+         stride_SB,
+         nullptr,
+         stride_SA,  // No zero point for A
+         nullptr,
+         stride_SB,  // No zero point for B
+         K},         // group_size = K for per-row/col scaling


mingfeima · 2025-11-11T02:31:59Z

src/sycl/fp8_scaled_mm.cpp

+    size_t workspace_size = Gemm::get_workspace_size(arguments);
+    cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);


mingfeima · 2025-11-11T02:33:31Z

src/sycl/fp8_scaled_mm.cpp

+static inline std::pair<float, float> get_fp8_range(at::ScalarType dtype) {
+  if (dtype == at::ScalarType::Float8_e4m3fn) {
+    // E4M3FN: max = 448, min = -448
+    return {-448.0f, 448.0f};
+  } else {
+    // Float8_e5m2
+    // E5M2: max = 57344, min = -57344
+    return {-57344.0f, 57344.0f};
+  }
+}


this should have been covered in torch, aten have overloaded std::numeric_limits

mingfeima · 2025-11-11T02:35:43Z

src/sycl/fp8_scaled_mm.cpp

+  if (out_dtype == at::ScalarType::BFloat16) {
+    using Config = Fp8GemmConfig<ElementInputFp8, cutlass::bfloat16_t>;
+    Fp8GemmRunner<typename Config::Gemm, cutlass::bfloat16_t> runner;
+    status = runner.run(mat_a_contig, mat_b_contig, scales_a_half, scales_b_half, out, hw_info);


in sglang, you can only implement bfloat16. out data type is bfloat16 or float16.

mingfeima · 2025-11-11T02:37:58Z

src/sycl/fp8_scaled_mm.cpp

+  at::ScalarType intermediate_dtype;
+  if (is_fp8_dtype(out_dtype)) {
+    intermediate_dtype = at::ScalarType::Half;
+  } else {
+    intermediate_dtype = out_dtype;
+  }


not needed.

mingfeima · 2025-11-11T02:40:00Z

src/sycl/fp8_scaled_mm.cpp

+  // Dispatch based on input FP8 type
+  if (input_dtype == at::ScalarType::Float8_e4m3fn) {
+    fp8_scaled_mm_impl<cutlass::float_e4m3_t>(
+        mat_a, mat_b, scales_a_half, scales_b_half, intermediate_dtype, out_intermediate, hw_info);
+  } else {
+    fp8_scaled_mm_impl<cutlass::float_e5m2_t>(
+        mat_a, mat_b, scales_a_half, scales_b_half, intermediate_dtype, out_intermediate, hw_info);
+  }


make it pytorch-like, use AT_DISPATCH_xxx macros.

if it is not available, make one of your own demand, you can also define other types in it, such as acc_scalar_t and so on.

mingfeima · 2025-11-11T02:42:06Z

src/sycl/fp8_scaled_mm.cpp

+    TORCH_CHECK(bias_tensor.size(0) == N, "bias must have size N");
+    TORCH_CHECK(bias_tensor.is_contiguous(), "bias must be contiguous");
+
+    if (is_fp8_dtype(out_dtype)) {


don't need this.

mingfeima · 2025-11-11T02:42:28Z

src/sycl/helper.h

@@ -0,0 +1,124 @@
+/***************************************************************************************************


duplicated.

mingfeima · 2025-11-11T02:43:43Z

tests/test_fp8_scaled_mm_xpu.py

+"""
+Test code for sgl_kernel.fp8_scaled_mm()
+
+Run as:
+python -m pytest -v -s test_fp8_scaled_mm_xpu.py
+"""


Suggested change

"""

Test code for sgl_kernel.fp8_scaled_mm()

Run as:

python -m pytest -v -s test_fp8_scaled_mm_xpu.py

"""

mingfeima · 2025-11-11T02:53:20Z

per block quantization is the most comm

right, this is mainly for BMG here, will evaluate performance. by the way per block quantization/scale is different api will have different implementation.

OK, per channel quantization is not welcome for recently released LLMs. Anyway, please provide performance data on battlemage.

airMeng

Make sure you update the CI and benchmarks

sgl-kernel-xpu/tests/run_suite.py

Line 19 in 1bb6c78

TestFile("test_flash_attention.py"),

sgl-kernel-xpu/benchmark/bench_fp8_gemm.py

Line 117 in 1bb6c78

lambda: sgl_scaled_mm(

sgl-kernel-xpu/.github/workflows/pr-test-xpu.yml

Line 58 in 1bb6c78

    
                       /bin/bash -c "cd /root/sglang/sgl-kernel-xpu/benchmark &&  python3 bench_flash_attn.py && python3 bench_moe_topk_softmax.py "

Basic structure of sgl_kernel.fp8_scaled_mm

2b7d4a1

* Added support for sgl_kernel.fp8_scaled_mm op * Input in dtype fp8 e4m3 or e5m2 * Output in dtype fp32, bf16, fp8 e4m3 or fp8 e5m2 Signed-off-by: Aditya Chatterjee <[email protected]>

deepvars added the run-ci label Nov 6, 2025

Trigger CI

f3a0a83

kareemshaik80 reviewed Nov 10, 2025

View reviewed changes

src/sycl/fp8_scaled_mm.cpp Show resolved Hide resolved

src/sycl/fp8_scaled_mm.cpp Show resolved Hide resolved

src/sycl/fp8_scaled_mm.cpp Show resolved Hide resolved

adityachatter force-pushed the achatter/fp8_scaled_mm branch from 6ae8ca6 to f3a0a83 Compare November 10, 2025 09:32

Merge branch 'main' into achatter/fp8_scaled_mm

f8dd8a6

adityachatter requested a review from kareemshaik80 November 10, 2025 10:05

airMeng reviewed Nov 11, 2025

View reviewed changes

mingfeima marked this pull request as draft November 11, 2025 01:52

mingfeima requested changes Nov 11, 2025

View reviewed changes

airMeng requested changes Nov 11, 2025

View reviewed changes

		size_t workspace_size = Gemm::get_workspace_size(arguments);
		cutlass::device_memory::allocation<uint8_t> workspace(workspace_size);

		@@ -0,0 +1,124 @@
		/***************************************************************************************************

Support of fp8_scaled_mm() on XPU #34

Are you sure you want to change the base?

Support of fp8_scaled_mm() on XPU #34

Conversation

adityachatter commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

airMeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mingfeima commented Nov 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kareemshaik80 commented Nov 11, 2025

Uh oh!

mingfeima commented Nov 11, 2025

Uh oh!

kareemshaik80 commented Nov 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mingfeima commented Nov 11, 2025

Uh oh!

airMeng left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mingfeima commented Nov 11, 2025 •

edited

Loading