[benchmark_inference] Enable NVFP4 with NVFuser's NVFP4 kernels #2725

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

crcrpar wants to merge 12 commits into main from still-nvfp4-run-failing

+220 −159

Collaborator

crcrpar commented Nov 11, 2025 •

edited

Loading

What does this PR do?

As per title, this PR enables NVFP4 in benchmark_inference.py using NVFuser's nvfp4 kernels.

on GB200, pjnl-20251113

nvfp4

$ NVFUSER_ENABLE="id_model(all)" python thunder/benchmarks/benchmark_inference.py --output-length 2 --enable-nvfp4 --mode thunder
...
============================================================
BENCHMARK RESULTS - meta-llama/Llama-4-Maverick-17B-128E thunder
============================================================

Throughput Metrics:
  Overall Throughput: 114.00 tokens/sec
  Prefill Throughput: 211098.56 tokens/sec
  Decode Throughput: 128.33 tokens/sec
  Latency: 10.03 ms/token

Latency Breakdown:
  Time to First Token (TTFT): 12.23 ms
  Time Between Output Tokens (TBOT): 7.82 ms
  Prefill Time: 12.23 ms
  Decode Time: 7.82 ms
  Total Generation Time: 20.05 ms

Memory Usage:
  Current Memory: 14.23 GB
  Peak Memory: 15.19 GB

Variance Analysis:
  Throughput Std Dev: 26.42 ms
  TTFT Std Dev: 26.08 ms

bf16

$ python thunder/benchmarks/benchmark_inference.py --output-length 2 --mode thunder
...
============================================================
BENCHMARK RESULTS - meta-llama/Llama-4-Maverick-17B-128E thunder
============================================================

Throughput Metrics:
  Overall Throughput: 120.80 tokens/sec
  Prefill Throughput: 220544.66 tokens/sec
  Decode Throughput: 137.55 tokens/sec
  Latency: 8.28 ms/token

Latency Breakdown:
  Time to First Token (TTFT): 9.29 ms
  Time Between Output Tokens (TBOT): 7.27 ms
  Prefill Time: 9.29 ms
  Decode Time: 7.27 ms
  Total Generation Time: 16.56 ms

Memory Usage:
  Current Memory: 37.39 GB
  Peak Memory: 38.34 GB

Variance Analysis:
  Throughput Std Dev: 0.19 ms
  TTFT Std Dev: 0.12 ms

cc: @IvanYashchuk

crcrpar requested review from KaelanDt, lantiga, mruberry and t-vi as code owners

November 11, 2025 12:42

crcrpar added the benchmarking label

crcrpar requested a review from jjsjann123

November 11, 2025 13:15

crcrpar changed the base branch from crpa/try-nvfuer5230 to main

November 14, 2025 07:40

crcrpar force-pushed the still-nvfp4-run-failing branch from 2652cb1 to 0cff702 Compare

November 14, 2025 07:41

crcrpar changed the title ~~[nvfp4 benchmark_inference] Let TorchDynamo work w/o errors~~ [benchmark_inference] Enable NVFP4 with NVFuser's NVFP4 kernels

crcrpar requested review from kshitij12345 and shino16

November 14, 2025 07:42

crcrpar commented

View reviewed changes

thunder/benchmarks/benchmark_inference.py

Comment on lines +757 to +744

    
                  parser.add_argument(

                      "--enable-nvfp4",

                      action="store_true",

                      help="Enable NVFP4 quantization for MoE GroupedSwiGLU layers (has nvfuser grouped_mm support)",

                  )

Collaborator Author

crcrpar Nov 14, 2025

This seems to requite NVFUSER_ENABLE="id_model(all)" at the moment. We might want to set the env var when this option is set.

Collaborator

jjsjann123 Nov 14, 2025

linking nvfuser issue: NVIDIA/Fuser#5200

jjsjann123 approved these changes

View reviewed changes

Collaborator

jjsjann123 left a comment

since we are going to merge this with the nvfuser benchmark. Let's merge it as-is and we can follow up with cleanup in the written out model.

thunder/benchmarks/benchmark_inference.py Outdated

    
                  parser.add_argument(

                      "--quantize-linear",

                      action="store_true",

                      help="[Experimental] Quantize nn.Linear to NVFP4. Note: nvfuser has not yet implemented nvfp4_matmul translator",

Collaborator

jjsjann123 Nov 14, 2025

I'm getting a hang with --quantize-linear

Collaborator Author

crcrpar Nov 15, 2025

let's remove it

Collaborator Author

crcrpar Nov 17, 2025

removed

thunder/benchmarks/layers_for_inference_benchmark.py

    
                      dtype=activation.dtype,

                  )

                  for i in range(fp4_weight.size(0)):

                      # NOTE: dequantize here doesn't look right, since we have (g, k, n)

Collaborator

jjsjann123 Nov 14, 2025

Note this is not used since we have registered translation rule for this op in nvfuser. So I don't think we have to bother fixing it for now.

Collaborator

jjsjann123 commented Nov 14, 2025

tagging @tbqh

shino16 reviewed

View reviewed changes

thunder/benchmarks/layers_for_inference_benchmark.py Show resolved Hide resolved

thunder/benchmarks/benchmark_inference.py Outdated Show resolved Hide resolved

thunder/benchmarks/layers_for_inference_benchmark.py Outdated Show resolved Hide resolved

crcrpar and others added 12 commits

November 17, 2025 01:34


          test nvfuser 5230

7f13eea

Signed-off-by: Masaki Kozuki <[email protected]>


          Implement NVFP4 custom operations registration and quantization optio…

3014c42

…ns in inference benchmark. Enhance `_quantize_llama4` to conditionally quantize linear layers. Update command-line arguments for NVFP4 registration and quantization control. Adjust custom operations to ensure correct tensor shapes and handling.


          Refactor NVFP4 custom operations registration and quantization logic.…

a0f82de

… Update `_quantize_llama4` to simplify linear layer quantization handling. Modify command-line arguments for NVFP4 to clarify usage and remove deprecated options. Add warnings for experimental features and ensure proper registration of custom ops.


          dedup args

87e9cd0

Signed-off-by: Masaki Kozuki <[email protected]>


          small fixes on the model

4002e04


          changes to fix meta info

ca09d94

Signed-off-by: Masaki Kozuki <[email protected]>


          trimming the last element in offsets

c5c8b76


          fixing weight layout

00f0066


          fix more transpose on the weight

4ed1ee3


          Remove NVFP4 GEMM related code

1ead05d

Signed-off-by: Masaki Kozuki <[email protected]>


          remove quantized linear

Signed-off-by: Masaki Kozuki <[email protected]>


          bring back docstirng of from_ methods

58482e8

Signed-off-by: Masaki Kozuki <[email protected]>

crcrpar force-pushed the still-nvfp4-run-failing branch from 26a9de1 to 58482e8 Compare

November 17, 2025 09:34

kshitij12345 reviewed

View reviewed changes

thunder/benchmarks/layers_for_inference_benchmark.py

    
                  # This handles both 2D (tokens, hidden) and 3D (batch, seq_len, hidden) inputs

                  out_features = fp4_weight.size(2)

                  output_shape = activation.shape[:-1] + (out_features,)

                  return torch.empty(output_shape, device=activation.device, dtype=torch.bfloat16)

Collaborator

kshitij12345 Nov 17, 2025

Should this function also verify that weight, activation and other relevant tensors are on the same device?

thunder/benchmarks/layers_for_inference_benchmark.py

    
                      new_moe.routed_experts.gate_proj.weight.data.copy_(gate_proj_w.transpose(-1, -2))

                      new_moe.routed_experts.up_proj.weight.data.copy_(up_proj_w.transpose(-1, -2))

                      new_moe.routed_experts.gate_proj.weight.data.copy_(gate_proj_w)

Collaborator

kshitij12345 Nov 17, 2025

I think this would revert the changes from #2659 leading to perf regression for BF16 grouped_mm path.

Collaborator

jjsjann123 Nov 17, 2025

Thanks a ton for pointing that out!

It's probably a good idea to have a better separation between bf16 and fp4 code path. But I could at least put this inside a conditional guarded by dtype.

thunder/benchmarks/layers_for_inference_benchmark.py

    
                      scale_factors[i] = linear_to_swizzled_128_4(cur_scale_factors)

                  return fp4_weight, scale_factors, global_scales, ab_strides, c_strides

                  return fp4_weight.transpose(-1, -2), scale_factors, global_scales

Collaborator

kshitij12345 Nov 17, 2025

Is it ok to transpose just fp4_weight but not scale_factors as the scale_factors were calculated before the transpose (or maybe the downstream code accounts for this)?

Collaborator

jjsjann123 Nov 17, 2025

Yes. The reason is that access through those pointers are computed by the kernel by hand. So stride here is really just used for validation.

The requirement is that both weight and scale factor would have k dimension as the fastest, which is what the quantization function produces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jjsjann123 jjsjann123 approved these changes

kshitij12345 kshitij12345 left review comments

shino16 shino16 left review comments

mruberry Awaiting requested review from mruberry mruberry is a code owner

lantiga Awaiting requested review from lantiga lantiga is a code owner

t-vi Awaiting requested review from t-vi

KaelanDt Awaiting requested review from KaelanDt KaelanDt is a code owner

At least 1 approving review is required to merge this pull request.

Labels