-
Notifications
You must be signed in to change notification settings - Fork 108
[benchmark_inference] Enable NVFP4 with NVFuser's NVFP4 kernels #2725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2652cb1 to
0cff702
Compare
| parser.add_argument( | ||
| "--enable-nvfp4", | ||
| action="store_true", | ||
| help="Enable NVFP4 quantization for MoE GroupedSwiGLU layers (has nvfuser grouped_mm support)", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to requite NVFUSER_ENABLE="id_model(all)" at the moment. We might want to set the env var when this option is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linking nvfuser issue: NVIDIA/Fuser#5200
jjsjann123
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are going to merge this with the nvfuser benchmark. Let's merge it as-is and we can follow up with cleanup in the written out model.
| parser.add_argument( | ||
| "--quantize-linear", | ||
| action="store_true", | ||
| help="[Experimental] Quantize nn.Linear to NVFP4. Note: nvfuser has not yet implemented nvfp4_matmul translator", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting a hang with --quantize-linear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
| dtype=activation.dtype, | ||
| ) | ||
| for i in range(fp4_weight.size(0)): | ||
| # NOTE: dequantize here doesn't look right, since we have (g, k, n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note this is not used since we have registered translation rule for this op in nvfuser. So I don't think we have to bother fixing it for now.
|
tagging @tbqh |
Signed-off-by: Masaki Kozuki <[email protected]>
…ns in inference benchmark. Enhance `_quantize_llama4` to conditionally quantize linear layers. Update command-line arguments for NVFP4 registration and quantization control. Adjust custom operations to ensure correct tensor shapes and handling.
… Update `_quantize_llama4` to simplify linear layer quantization handling. Modify command-line arguments for NVFP4 to clarify usage and remove deprecated options. Add warnings for experimental features and ensure proper registration of custom ops.
Signed-off-by: Masaki Kozuki <[email protected]>
Signed-off-by: Masaki Kozuki <[email protected]>
Signed-off-by: Masaki Kozuki <[email protected]>
Signed-off-by: Masaki Kozuki <[email protected]>
Signed-off-by: Masaki Kozuki <[email protected]>
26a9de1 to
58482e8
Compare
| # This handles both 2D (tokens, hidden) and 3D (batch, seq_len, hidden) inputs | ||
| out_features = fp4_weight.size(2) | ||
| output_shape = activation.shape[:-1] + (out_features,) | ||
| return torch.empty(output_shape, device=activation.device, dtype=torch.bfloat16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this function also verify that weight, activation and other relevant tensors are on the same device?
|
|
||
| new_moe.routed_experts.gate_proj.weight.data.copy_(gate_proj_w.transpose(-1, -2)) | ||
| new_moe.routed_experts.up_proj.weight.data.copy_(up_proj_w.transpose(-1, -2)) | ||
| new_moe.routed_experts.gate_proj.weight.data.copy_(gate_proj_w) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would revert the changes from #2659 leading to perf regression for BF16 grouped_mm path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton for pointing that out!
It's probably a good idea to have a better separation between bf16 and fp4 code path. But I could at least put this inside a conditional guarded by dtype.
| scale_factors[i] = linear_to_swizzled_128_4(cur_scale_factors) | ||
|
|
||
| return fp4_weight, scale_factors, global_scales, ab_strides, c_strides | ||
| return fp4_weight.transpose(-1, -2), scale_factors, global_scales |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok to transpose just fp4_weight but not scale_factors as the scale_factors were calculated before the transpose (or maybe the downstream code accounts for this)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The reason is that access through those pointers are computed by the kernel by hand. So stride here is really just used for validation.
The requirement is that both weight and scale factor would have k dimension as the fastest, which is what the quantization function produces.
What does this PR do?
As per title, this PR enables NVFP4 in benchmark_inference.py using NVFuser's nvfp4 kernels.
on GB200, pjnl-20251113
nvfp4
bf16
cc: @IvanYashchuk