Skip to content

Commit 1241987

Browse files
authored
Update 2025-09-25-gb200-part-2.md (#210)
1 parent 06ce1c1 commit 1241987

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

blog/2025-09-25-gb200-part-2.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ date: "September 25, 2025"
55
previewImg: /images/blog/gb200_part_2/primary.png
66
---
77

8-
The GB200 NVL72 is one of the most powerful hardware for deep learning. In this blog post, we share our progress to optimize the inference performance of DeepSeek V3/R1 with FP8 attention, NVFP4 MoE, large-scale expert parallelism, prefill-decode disaggregation, and various other optimizations. When using FP8 attention and NVFP4 MoE, SGLang achieved 26,156 input and 13,386 output tokens per second per GPU for prefill and decode, respectively, on DeepSeek V3/R1 for 2000-token input sequences, which is a 3.8x and 4.8x speedup compared to [H100 settings](https://lmsys.org/blog/2025-05-05-large-scale-ep/). Even with traditional BF16 attention and FP8 MoE, SGLang still achieves 18,471 input and 9,087 output tokens per second. Reproduction instructions can be found [here](https://github.com/sgl-project/sglang/issues/10903).
8+
The GB200 NVL72 is one of the most powerful hardware for deep learning. In this blog post, we share our progress after our [previous blog post](https://lmsys.org/blog/2025-06-16-gb200-part-1/) to optimize the inference performance of DeepSeek V3/R1 with FP8 attention, NVFP4 MoE, large-scale expert parallelism, prefill-decode disaggregation, and various other optimizations. When using FP8 attention and NVFP4 MoE, SGLang achieved 26,156 input and 13,386 output tokens per second per GPU for prefill and decode, respectively, on DeepSeek V3/R1 for 2000-token input sequences, which is a 3.8x and 4.8x speedup compared to [H100 settings](https://lmsys.org/blog/2025-05-05-large-scale-ep/). Even with traditional BF16 attention and FP8 MoE, SGLang still achieves 18,471 input and 9,087 output tokens per second. Reproduction instructions can be found [here](https://github.com/sgl-project/sglang/issues/10903).
99

1010
**Highlights**
1111

0 commit comments

Comments
 (0)