-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Closed
Description
[Proposal] Kimi-K2 performance enhancement on H20 GPU
Summary
Our current test found that the performance of Kimi k2 under TP16 is very poor, in the input and output 3500/1500 scenarios, to meet the SLO for TTFT < 5s and TPOT < 50ms single card total throughput can only reach 36 token/s, so determine the plan aims to quickly improve the performance of Kimi k2 on H20 hardware, fix the bugs in the process, and give the best practices.
Roadmap
- Kimi k2 fuse_moe TP16 triton config on H20 @artetaout H20 tune config for Kimi #8047
- Kimi k2 fuse_moe TP16 triton config on H20-3e @GaoYusong perf: add kimi k2 fused_moe tuning config for h30_3e #8021
- Kimi k2 W4A8 on EP mode on H20 or H20-3e @yangsijia-serena feat: support DeepSeek-R1-W4AFP8 model with ep-moe mode #7762
- Kimi k2 W4A8 on TP mode on H20 or H20-3e @chenxijun1029 [feat] Support tp mode for DeepSeek-R1-W4AFP8 #8118
- Train the Kimi k2 Eagle3 model @zhangxiaolei123456
- Kimi k2 W4A8 on EP model support deepep @ayrnb
- Kimi k2 support PD Disaggregation on H20-3e @zhangxiaolei123456
- Bugfix: pd transfer timeout when input length > 8k on h20 [PD Disaggregation] Replace the sync batch transfer with async batch transfer in KVCache transfer #7695
- Feature: performance enhance W4A8 on EP mode.
- Feature: support Flux groupgemm and allreduce fusion.
- Kimi k2 W4A8 on TP mode support PD Disaggregation on H20 @Layssy
- Kimi k2 support PD Disaggregation and large scale EP on H20 @HanHan009527
- Kimi k2 W4A8 support PD Disaggregation and large scale EP on H20(Prefill on TP, Decode on EP) @zhangxiaolei123456
HanHan009527, zhyncs, YangQun1, hzh0425, artetaout and 11 more