sparse_flash_attention operator add to NPU kernel #236

AndyLi429 · 2025-12-04T07:52:41Z

add a sparse flash attention operators for sglang\npu.

Unit Test

python tests/python/sgl_kernel_npu/test_sparse_flash_attention.py
test results:
Ran 2 tests in 0.970s

OK

develop test on sglang at A3 device

eagle mode:

python test_gms8k.py
start
100/100 [03:36<00:00, 2.16s/it]
Accuracy: 0.980
Invalid: 0.000
Latency: 218.359 s
Output throughput: 40.681 token/s
metrics={'accuracy': 0.98, 'invalid': 0.0, 'latency': 218.35866242006887, 'output_throughput': 40.68077676218437}
metrics['accuracy']=0.98

Graph mode:

python test_gms8k.py
start
100/100 [02:17<00:00, 1.38s/it]
Accuracy: 0.980
Invalid: 0.000
Latency: 138.304 s
Output throughput: 62.175 token/s
metrics={'accuracy': 0.98, 'invalid': 0.0, 'latency': 138.303909559967, 'output_throughput': 62.174670458405025}
metrics['accuracy']=0.98

gemini-code-assist · 2025-12-04T07:52:45Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

…to sfa_ops

AndyLi429 · 2025-12-15T11:49:53Z

/gemini check

gemini-code-assist · 2025-12-15T11:50:09Z

It looks like you tried to invoke the command /gemini check, but this is not a recognized command. Please use /gemini review to request a code review, /gemini summary for a pull request summary, or /gemini help for a list of available commands.

AndyLi429 · 2025-12-15T11:53:33Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a new sparse_flash_attention operator, which is a significant feature. The implementation is comprehensive, including host-side logic, tiling configurations, and the NPU kernel. The inclusion of documentation and tests is commendable. However, I have identified several critical issues that could lead to crashes or compilation failures, along with some high and medium severity issues regarding correctness, maintainability, and documentation accuracy. Please review the detailed comments for specific feedback on these points.

csrc/pytorch_extensions.cpp

csrc/sparse_flash_attention/op_host/sparse_flash_attention.cpp

csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_tiling.cpp

csrc/sparse_flash_attention/README.md

csrc/sparse_flash_attention/op_host/sparse_flash_attention.cpp

csrc/sparse_flash_attention/op_host/sparse_flash_attention_def.h

csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_data.h

csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_kernel.cpp

csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_kernel_mla.h

AndyLi429 · 2025-12-16T09:13:30Z

/gemini summary

gemini-code-assist · 2025-12-16T09:14:20Z

Summary of Changes

This pull request introduces a new Sparse Flash Attention operator tailored for SGLang on NPU, aiming to enhance the efficiency of attention mechanisms for extended sequences by leveraging sparse computation patterns. The changes encompass the full lifecycle of operator integration, from build system updates and Python binding registration to the detailed C++ implementation of host-side logic, tiling, and kernel execution, all supported by new unit tests and comprehensive documentation.

Highlights

New Sparse Flash Attention Operator: Introduced a new sparse_flash_attention operator for SGLang on NPU, designed to optimize attention computations for long sequences with sparse patterns.
Build System Integration: Updated CMakeLists.txt to include the new Sparse Flash Attention source files and link against the nnopbase library, ensuring the operator is correctly built.
Operator Registration: The sparse_flash_attention operator has been registered in pytorch_extensions.cpp with its full signature, making it accessible from Python via torch.ops.npu.sparse_flash_attention.
Comprehensive Implementation: Added a complete implementation for the Sparse Flash Attention operator, including its host-side logic, detailed tiling data structures, tiling strategy, and the core kernel implementation for both cube and vector processing units.
New Test Cases: Included new Python test cases (test_sfa_eager and test_tnd_pabsnd_sfa_eager) to validate the functionality and precision of the new Sparse Flash Attention operator against a CPU reference implementation.
Documentation: Provided extensive documentation in README.md for the new operator, detailing its function, prototype, parameters, return values, constraints, and a usage example.

Changelog

csrc/CMakeLists.txt
- Added sparse_flash_attention.cpp and sparse_flash_attention_tiling.cpp to OP_SRCS.
- Added sparse_flash_attention_kernel.cpp to WORKSPACE_KERNEL_SRCS.
- Linked the nnopbase library to the OP_PLUGIN_NAME target.
- Added ${ASCEND_INCLUDE_DIR}/ to target include directories.
csrc/pytorch_extensions.cpp
- Registered the sparse_flash_attention operator with its detailed signature, including query, key, value, sparse indices, scale value, sparse block size, and various optional parameters.
- Implemented the sparse_flash_attention operator using TORCH_FN(sglang::npu_kernel::sparse_flash_attention).
csrc/sparse_flash_attention/README.md
- Added a new README file documenting the torch.ops.npu.sparse_flash_attention operator.
- Detailed function description, prototype, parameter descriptions, return value, constraints, and a Python usage example.
csrc/sparse_flash_attention/op_host/sparse_flash_attention.cpp
- Implemented the host-side logic for the sparse_flash_attention operator.
- Includes helper functions for constructing output tensors, setting attributes, handling optional tensors, registering tensors to the tiling context, creating workspace and tiling tensors, and performing tiling operations.
csrc/sparse_flash_attention/op_host/sparse_flash_attention_def.h
- Defined the SparseFlashAttention operator class, inheriting from OpDef.
- Specified required and optional inputs (query, key, value, sparse_indices, block_table, actual_seq_lengths_query, actual_seq_lengths_kv, query_rope, key_rope) and output (attention_out).
- Defined required and optional attributes (scale_value, sparse_block_size, layout_query, layout_kv, sparse_mode) with their types and default values.
csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_data.h
- Defined data structures for Sparse Flash Attention tiling parameters.
- Includes SparseFlashAttentionBaseParamsMla, SparseFlashAttentionSplitKVParamsMla, SparseFlashAttentionSingleCoreParamsMla, SparseFlashAttentionSingleCoreTensorSizeMla, and SparseFlashAttentionInnerSplitParams.
csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_tiling.cpp
- Implemented the tiling logic for the Sparse Flash Attention operator.
- Includes functions for parameter initialization, core splitting, workspace size calculation, and tiling key generation.
- Added comprehensive checks for data types, layouts, dimensions, and parameter consistency.
csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_tiling.h
- Declared enums for SFALayout, KvStorageMode, SFAPerfMode, and SFAAxis.
- Defined structures for SFATilingShapeCompareParam, SFARequiredParaInfo, SFAOptionalParaInfo, SFAParaInfo, and SFATilingInfo.
- Declared SFAMlaTiling and SFATilingCheck classes for tiling logic and parameter validation, respectively.
- Declared SFAInfoParser class for parsing operator information.
csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_common.h
- Defined common structures and utilities for the Sparse Flash Attention kernel.
- Includes SFA_LAYOUT enum, SFAType template struct, RunInfo struct for loop parameters, and ConstInfo struct for constant kernel information.
- Provided utility functions for alignment (SFAAlign) and minimum value (Min).
csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_kernel.cpp
- Implemented the kernel entry point for sparse_flash_attention.
- Dispatches to different SparseFlashAttentionMla kernel instances based on a tilingKey derived from data types and layouts.
csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_kernel_mla.h
- Declared the SparseFlashAttentionMla class template, which implements the core logic for the Sparse Flash Attention kernel.
- Includes initialization of tiling data, buffers, and calculation parameters.
- Defines the Process method for orchestrating the kernel execution, including a preload pipeline for efficient data handling.
csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_service_cube_mla.h
- Defined the SFAMatmulService class template for handling cube-related operations in the Sparse Flash Attention kernel.
- Includes methods for initializing global tensors, buffers, and event IDs.
- Provides functions for copying data from GM to L1 and performing matrix multiplications (MM1 and MM2).
csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_service_vector_mla.h
- Defined the SFAVectorService class template for handling vector-related operations in the Sparse Flash Attention kernel.
- Includes methods for initializing buffers, event IDs, and softmax default buffers.
- Provides functions for element-wise computations, softmax, and merging KV data.
include/sgl_kenel_npu_ops.h
- Added the declaration for the sparse_flash_attention function within the sglang::npu_kernel namespace.
tests/python/sgl_kernel_npu/test_sparse_flash_attention.py
- Added a new Python test file for the sparse_flash_attention operator.
- Includes helper functions for CPU reference computation (gather_kv, mask, softmax, pa_to_bsnd, trans_tnd_to_bsnd, trans_bnsd_to_tnd, trans_tnd_actseq).
- Implemented TestCustomSFA class with two test methods: test_sfa_eager (BSND layout) and test_tnd_pabsnd_sfa_eager (TND query, PA_BSND KV layout).

Activity

The bot initially indicated a daily quota limit.
The author attempted to use an unrecognized command /gemini check.
The bot responded by clarifying the available commands (/gemini review, /gemini summary, /gemini help).
The author then requested a code review using /gemini review.
The bot provided several critical and medium priority review comments, highlighting issues such as missing spaces in operator schema, unsafe optional parameter access, inconsistent default values, incorrect macro usage in assertions, and discrepancies in documentation and usage examples.
The author requested a summary using /gemini summary.

…into sfa_ops

csrc/CMakeLists.txt

csrc/sparse_flash_attention/op_host/sparse_flash_attention.cpp

csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_data.h

csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_kernel.cpp

tests/python/sgl_kernel_npu/test_sparse_flash_attention.py

csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_common.h

csrc/sparse_flash_attention/op_host/sparse_flash_attention.cpp

csrc/sparse_flash_attention/README.md

…to sfa_ops

AndyLi429 force-pushed the sfa_ops branch 6 times, most recently from 5ad470a to a1d8f40 Compare December 9, 2025 08:19

add sfa ops

50a282b

AndyLi429 force-pushed the sfa_ops branch from fbb70e0 to 50a282b Compare December 12, 2025 06:46

AndyLi429 added 2 commits December 12, 2025 14:48

Merge branch 'main' of https://github.com/AndyLi429/sgl-kernel-npu in…

f1902a7

…to sfa_ops

ADD SFA OPS

1e5a680

AndyLi429 force-pushed the sfa_ops branch 2 times, most recently from 63d8bee to 0614e03 Compare December 15, 2025 09:11

fix debug code

bf8bab5

AndyLi429 force-pushed the sfa_ops branch from c5ccd62 to bf8bab5 Compare December 15, 2025 09:29

gemini-code-assist bot reviewed Dec 15, 2025

View reviewed changes

fix gemini assitant review

3fb7757

AndyLi429 and others added 8 commits December 19, 2025 17:29

fix code

f2f9438

add buffer

976d142

Merge branch 'sgl-project:main' into sfa_ops

c35d33c

add memcpy

3538323

Merge branch 'sfa_ops' of https://github.com/AndyLi429/sgl-kernel-npu …

e5fe825

…into sfa_ops

Merge branch 'sgl-project:main' into sfa_ops

dd4165f

fix bug

c2a58b5

Merge branch 'sfa_ops' of https://github.com/AndyLi429/sgl-kernel-npu …

e35011e

…into sfa_ops

fix pre-commit

04f28f3

AndyLi429 changed the title ~~add sfa ops~~ sparse_flash_attention operator add to NPU kernel Dec 24, 2025

randgun reviewed Dec 25, 2025

View reviewed changes

RuixuanZhang06 reviewed Dec 25, 2025

View reviewed changes

csrc/sparse_flash_attention/op_host/sparse_flash_attention.cpp Outdated Show resolved Hide resolved

AndyLi429 commented Dec 25, 2025

View reviewed changes

csrc/sparse_flash_attention/README.md Outdated Show resolved Hide resolved

RuixuanZhang06 previously approved these changes Dec 26, 2025

View reviewed changes

AndyLi429 requested a review from randgun December 26, 2025 09:44

AndyLi429 dismissed RuixuanZhang06’s stale review via 105e9b0 December 27, 2025 03:44

fix code

b14a406

AndyLi429 force-pushed the sfa_ops branch from b1192b1 to b14a406 Compare December 31, 2025 02:41

AndyLi429 added 3 commits December 31, 2025 10:42

Merge branch 'main' of https://github.com/AndyLi429/sgl-kernel-npu in…

340fcd5

…to sfa_ops

fix

ab6e52e

fix code

0810257

sparse_flash_attention operator add to NPU kernel #236

Are you sure you want to change the base?

sparse_flash_attention operator add to NPU kernel #236

Uh oh!

Conversation

AndyLi429 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Unit Test

develop test on sglang at A3 device

eagle mode:

Graph mode:

Uh oh!

gemini-code-assist bot commented Dec 4, 2025

Uh oh!

AndyLi429 commented Dec 15, 2025

Uh oh!

gemini-code-assist bot commented Dec 15, 2025

Uh oh!

AndyLi429 commented Dec 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndyLi429 commented Dec 16, 2025

Uh oh!

gemini-code-assist bot commented Dec 16, 2025

Summary of Changes

Highlights

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AndyLi429 commented Dec 4, 2025 •

edited

Loading