-
Notifications
You must be signed in to change notification settings - Fork 70
sparse_flash_attention operator add to NPU kernel #236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
5ad470a to
a1d8f40
Compare
63d8bee to
0614e03
Compare
|
/gemini check |
|
It looks like you tried to invoke the command |
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new sparse_flash_attention operator, which is a significant feature. The implementation is comprehensive, including host-side logic, tiling configurations, and the NPU kernel. The inclusion of documentation and tests is commendable. However, I have identified several critical issues that could lead to crashes or compilation failures, along with some high and medium severity issues regarding correctness, maintainability, and documentation accuracy. Please review the detailed comments for specific feedback on these points.
csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_tiling.cpp
Outdated
Show resolved
Hide resolved
csrc/sparse_flash_attention/op_host/sparse_flash_attention_def.h
Outdated
Show resolved
Hide resolved
csrc/sparse_flash_attention/op_host/tiling/sparse_flash_attention_data.h
Show resolved
Hide resolved
csrc/sparse_flash_attention/op_kernel/sparse_flash_attention_kernel_mla.h
Outdated
Show resolved
Hide resolved
|
/gemini summary |
Summary of ChangesThis pull request introduces a new Sparse Flash Attention operator tailored for SGLang on NPU, aiming to enhance the efficiency of attention mechanisms for extended sequences by leveraging sparse computation patterns. The changes encompass the full lifecycle of operator integration, from build system updates and Python binding registration to the detailed C++ implementation of host-side logic, tiling, and kernel execution, all supported by new unit tests and comprehensive documentation. Highlights
Changelog
Activity
|
add a sparse flash attention operators for sglang\npu.
Unit Test
python tests/python/sgl_kernel_npu/test_sparse_flash_attention.py
test results:
Ran 2 tests in 0.970s
OK
develop test on sglang at A3 device
eagle mode:
python test_gms8k.py
start
100/100 [03:36<00:00, 2.16s/it]
Accuracy: 0.980
Invalid: 0.000
Latency: 218.359 s
Output throughput: 40.681 token/s
metrics={'accuracy': 0.98, 'invalid': 0.0, 'latency': 218.35866242006887, 'output_throughput': 40.68077676218437}
metrics['accuracy']=0.98
Graph mode:
python test_gms8k.py
start
100/100 [02:17<00:00, 1.38s/it]
Accuracy: 0.980
Invalid: 0.000
Latency: 138.304 s
Output throughput: 62.175 token/s
metrics={'accuracy': 0.98, 'invalid': 0.0, 'latency': 138.303909559967, 'output_throughput': 62.174670458405025}
metrics['accuracy']=0.98