Skip to content

Conversation

@AndyLi429
Copy link

@AndyLi429 AndyLi429 commented Dec 4, 2025

add a sparse flash attention operators for sglang\npu.

Unit Test

python tests/python/sgl_kernel_npu/test_sparse_flash_attention.py
test results:
Ran 2 tests in 0.970s

OK

develop test on sglang at A3 device

eagle mode:

python test_gms8k.py
start
100/100 [03:36<00:00, 2.16s/it]
Accuracy: 0.980
Invalid: 0.000
Latency: 218.359 s
Output throughput: 40.681 token/s
metrics={'accuracy': 0.98, 'invalid': 0.0, 'latency': 218.35866242006887, 'output_throughput': 40.68077676218437}
metrics['accuracy']=0.98

Graph mode:

python test_gms8k.py
start
100/100 [02:17<00:00, 1.38s/it]
Accuracy: 0.980
Invalid: 0.000
Latency: 138.304 s
Output throughput: 62.175 token/s
metrics={'accuracy': 0.98, 'invalid': 0.0, 'latency': 138.303909559967, 'output_throughput': 62.174670458405025}
metrics['accuracy']=0.98

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@AndyLi429 AndyLi429 force-pushed the sfa_ops branch 6 times, most recently from 5ad470a to a1d8f40 Compare December 9, 2025 08:19
@AndyLi429 AndyLi429 force-pushed the sfa_ops branch 2 times, most recently from 63d8bee to 0614e03 Compare December 15, 2025 09:11
@AndyLi429
Copy link
Author

/gemini check

@gemini-code-assist
Copy link
Contributor

It looks like you tried to invoke the command /gemini check, but this is not a recognized command. Please use /gemini review to request a code review, /gemini summary for a pull request summary, or /gemini help for a list of available commands.

@AndyLi429
Copy link
Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new sparse_flash_attention operator, which is a significant feature. The implementation is comprehensive, including host-side logic, tiling configurations, and the NPU kernel. The inclusion of documentation and tests is commendable. However, I have identified several critical issues that could lead to crashes or compilation failures, along with some high and medium severity issues regarding correctness, maintainability, and documentation accuracy. Please review the detailed comments for specific feedback on these points.

@AndyLi429
Copy link
Author

/gemini summary

@gemini-code-assist
Copy link
Contributor

Summary of Changes

This pull request introduces a new Sparse Flash Attention operator tailored for SGLang on NPU, aiming to enhance the efficiency of attention mechanisms for extended sequences by leveraging sparse computation patterns. The changes encompass the full lifecycle of operator integration, from build system updates and Python binding registration to the detailed C++ implementation of host-side logic, tiling, and kernel execution, all supported by new unit tests and comprehensive documentation.

Highlights

  • New Sparse Flash Attention Operator: Introduced a new sparse_flash_attention operator for SGLang on NPU, designed to optimize attention computations for long sequences with sparse patterns.
  • Build System Integration: Updated CMakeLists.txt to include the new Sparse Flash Attention source files and link against the nnopbase library, ensuring the operator is correctly built.
  • Operator Registration: The sparse_flash_attention operator has been registered in pytorch_extensions.cpp with its full signature, making it accessible from Python via torch.ops.npu.sparse_flash_attention.
  • Comprehensive Implementation: Added a complete implementation for the Sparse Flash Attention operator, including its host-side logic, detailed tiling data structures, tiling strategy, and the core kernel implementation for both cube and vector processing units.
  • New Test Cases: Included new Python test cases (test_sfa_eager and test_tnd_pabsnd_sfa_eager) to validate the functionality and precision of the new Sparse Flash Attention operator against a CPU reference implementation.
  • Documentation: Provided extensive documentation in README.md for the new operator, detailing its function, prototype, parameters, return values, constraints, and a usage example.
Changelog
  • csrc/CMakeLists.txt
    • Added sparse_flash_attention.cpp and sparse_flash_attention_tiling.cpp to OP_SRCS.
    • Added sparse_flash_attention_kernel.cpp to WORKSPACE_KERNEL_SRCS.
    • Linked the nnopbase library to the OP_PLUGIN_NAME target.
    • Added ${ASCEND_INCLUDE_DIR}/ to target include directories.
  • csrc/pytorch_extensions.cpp
    • Registered the sparse_flash_attention operator with its detailed signature, including query, key, value, sparse indices, scale value, sparse block size, and various optional parameters.
    • Implemented the sparse_flash_attention operator using TORCH_FN(sglang::npu_kernel::sparse_flash_attention).
  • csrc/sparse_flash_attention/README.md
    • Added a new README file documenting the torch.ops.npu.sparse_flash_attention operator.
    • Detailed function description, prototype, parameter descriptions, return value, constraints, and a Python usage example.
  • csrc/sparse_flash_attention/op_host/sparse_flash_attention.cpp
    • Implemented the host-side logic for the sparse_flash_attention operator.
    • Includes helper functions for constructing output tensors, setting attributes, handling optional tensors, registering tensors to the tiling context, creating workspace and tiling tensors, and performing tiling operations.
  • csrc/sparse_flash_attention/op_host/sparse_flash_attention_def.h
    • Defined the SparseFlashAttention operator class, inheriting from OpDef.
    • Specified required and optional inputs (query, key, value, sparse_indices, block_table, actual_seq_lengths_query, actual_seq_lengths_kv, query_rope, key_rope) and output (attention_out).
    • Defined required and optional attributes (scale_value, sparse_block_size, layout_query, layout_kv, sparse_mode) with their types and default values.
  • csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_data.h
    • Defined data structures for Sparse Flash Attention tiling parameters.
    • Includes SparseFlashAttentionBaseParamsMla, SparseFlashAttentionSplitKVParamsMla, SparseFlashAttentionSingleCoreParamsMla, SparseFlashAttentionSingleCoreTensorSizeMla, and SparseFlashAttentionInnerSplitParams.
  • csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_tiling.cpp
    • Implemented the tiling logic for the Sparse Flash Attention operator.
    • Includes functions for parameter initialization, core splitting, workspace size calculation, and tiling key generation.
    • Added comprehensive checks for data types, layouts, dimensions, and parameter consistency.
  • csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_tiling.h
    • Declared enums for SFALayout, KvStorageMode, SFAPerfMode, and SFAAxis.
    • Defined structures for SFATilingShapeCompareParam, SFARequiredParaInfo, SFAOptionalParaInfo, SFAParaInfo, and SFATilingInfo.
    • Declared SFAMlaTiling and SFATilingCheck classes for tiling logic and parameter validation, respectively.
    • Declared SFAInfoParser class for parsing operator information.
  • csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_common.h
    • Defined common structures and utilities for the Sparse Flash Attention kernel.
    • Includes SFA_LAYOUT enum, SFAType template struct, RunInfo struct for loop parameters, and ConstInfo struct for constant kernel information.
    • Provided utility functions for alignment (SFAAlign) and minimum value (Min).
  • csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_kernel.cpp
    • Implemented the kernel entry point for sparse_flash_attention.
    • Dispatches to different SparseFlashAttentionMla kernel instances based on a tilingKey derived from data types and layouts.
  • csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_kernel_mla.h
    • Declared the SparseFlashAttentionMla class template, which implements the core logic for the Sparse Flash Attention kernel.
    • Includes initialization of tiling data, buffers, and calculation parameters.
    • Defines the Process method for orchestrating the kernel execution, including a preload pipeline for efficient data handling.
  • csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_service_cube_mla.h
    • Defined the SFAMatmulService class template for handling cube-related operations in the Sparse Flash Attention kernel.
    • Includes methods for initializing global tensors, buffers, and event IDs.
    • Provides functions for copying data from GM to L1 and performing matrix multiplications (MM1 and MM2).
  • csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_service_vector_mla.h
    • Defined the SFAVectorService class template for handling vector-related operations in the Sparse Flash Attention kernel.
    • Includes methods for initializing buffers, event IDs, and softmax default buffers.
    • Provides functions for element-wise computations, softmax, and merging KV data.
  • include/sgl_kenel_npu_ops.h
    • Added the declaration for the sparse_flash_attention function within the sglang::npu_kernel namespace.
  • tests/python/sgl_kernel_npu/test_sparse_flash_attention.py
    • Added a new Python test file for the sparse_flash_attention operator.
    • Includes helper functions for CPU reference computation (gather_kv, mask, softmax, pa_to_bsnd, trans_tnd_to_bsnd, trans_bnsd_to_tnd, trans_tnd_actseq).
    • Implemented TestCustomSFA class with two test methods: test_sfa_eager (BSND layout) and test_tnd_pabsnd_sfa_eager (TND query, PA_BSND KV layout).
Activity
  • The bot initially indicated a daily quota limit.
  • The author attempted to use an unrecognized command /gemini check.
  • The bot responded by clarifying the available commands (/gemini review, /gemini summary, /gemini help).
  • The author then requested a code review using /gemini review.
  • The bot provided several critical and medium priority review comments, highlighting issues such as missing spaces in operator schema, unsafe optional parameter access, inconsistent default values, incorrect macro usage in assertions, and discrepancies in documentation and usage examples.
  • The author requested a summary using /gemini summary.

@AndyLi429 AndyLi429 changed the title add sfa ops sparse_flash_attention operator add to NPU kernel Dec 24, 2025
RuixuanZhang06
RuixuanZhang06 previously approved these changes Dec 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants