Dev i #4

yiliu30 · 2025-09-03T08:16:03Z

SUMMARY:
"please provide a brief summary"

TEST PLAN:
"please outline how the changes were tested"

Signed-off-by: yiliu30 <[email protected]>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Mxfp8 reland

Signed-off-by: yiliu30 <[email protected]>

github-actions · 2025-09-03T08:16:11Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the quantization capabilities and examples within the llmcompressor library. It introduces a new example for quantizing DeepSeek MoE models, refines the existing Llama3 quantization example for greater flexibility, and significantly improves the underlying logic for handling and inferring various quantization formats, particularly for FP4 and MXFP4 schemes. These changes aim to broaden the applicability and robustness of the quantization process.

Highlights

Updated Llama3 Quantization Example: The llama3_example.py script has been updated to allow for more flexible model ID and quantization scheme selection, including local model paths and MXFP4/NVFP4 schemes. The number of calibration samples was also reduced for quicker testing.
New DeepSeek MoE Quantization Example: A new example script deepseek_moe_w4a4_nvfp4.py has been added to demonstrate 4-bit quantization (NVFP4) for DeepSeek Mixture of Experts (MoE) models, including dataset preparation and model saving.
Enhanced Quantization Calibration Logic: Modifications in calibration.py improve the handling of global scales for MX quantization schemes and refine the conditions for calculating activation quantization parameters based on FP4 schemes.
Refined Quantization Format Inference: The logic for inferring and setting per-module quantization formats in quantization_format.py has been significantly refactored to support various packed and unpacked quantization formats (e.g., NVFP4, MXFP4, Marlin 2:4, int4/8, float8) based on the specific quantization arguments.
Improved Quantization Helper Utilities: Helper functions in helpers.py now include checks for FP4 quantization schemes when validating tensor group quantization, ensuring compatibility with new quantization formats.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for new quantization formats, including MXFP4, and provides corresponding examples. While the core logic changes are a good step forward, the example scripts contain hardcoded, user-specific paths and leftover debugging code that must be removed to make them usable. Additionally, there are several issues in the library code, such as messy imports, a potential IndexError, and a return type mismatch, which impact maintainability and correctness. These issues should be addressed before merging.

gemini-code-assist · 2025-09-03T08:18:23Z

examples/quantization_w4a4_fp4/llama3_example.py

+MODEL_ID = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/"
+# MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"
+scheme_name = "NVFP4"
+scheme_name = "MXFP4"
+# scheme_name = "MXFP8"
+# scheme_name = "FP8"
+
+SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + f"-{scheme_name}"
+SAVE_DIR = f"/data5/yliu7/HF_HOME/{SAVE_DIR}"
+print(f"Saving to {SAVE_DIR}")


This block contains hardcoded, user-specific paths, multiple re-definitions of variables, and a debug print statement. This appears to be temporary development code that should be removed before merging, as it makes the example non-runnable for other users.

gemini-code-assist · 2025-09-03T08:18:23Z

examples/quantizing_moe/deepseek_moe_w4a4_nvfp4.py

+MODEL_ID = "deepseek-ai/DeepSeek-V2.5"
+MODEL_ID = "/data0/deepseek-ai/DeepSeek-V2-Lite"
+MODEL_ID = "/data0/deepseek-ai/DeepSeek-R1"
+MODEL_ID = "/data1/DeepSeek-R1-bf16"


This example script hardcodes multiple MODEL_IDs, including local, user-specific paths. Please use a single, public model ID from the Hugging Face Hub to ensure the example is runnable by others.

Suggested change

MODEL_ID = "deepseek-ai/DeepSeek-V2.5"

MODEL_ID = "/data0/deepseek-ai/DeepSeek-V2-Lite"

MODEL_ID = "/data0/deepseek-ai/DeepSeek-R1"

MODEL_ID = "/data1/DeepSeek-R1-bf16"

MODEL_ID = "deepseek-ai/DeepSeek-V2-Lite"

gemini-code-assist · 2025-09-03T08:18:23Z

src/llmcompressor/modifiers/quantization/calibration.py

+from compressed_tensors.quantization.utils import (
+    is_fp4,
+    is_kv_cache_quant_scheme,
+    is_mx,
+    is_mxfp4,
+)
+from compressed_tensors.utils import align_module_device, update_parameter_data
 from compressed_tensors.quantization.utils import is_kv_cache_quant_scheme
 from compressed_tensors.utils import align_module_device, update_offload_parameter


There are duplicate imports in this block. is_kv_cache_quant_scheme and align_module_device are imported twice. Additionally, update_parameter_data is imported but does not appear to be used in this file. Please consolidate and clean up the imports to improve code clarity and maintainability.

Suggested change

from compressed_tensors.quantization.utils import (

is_fp4,

is_kv_cache_quant_scheme,

is_mx,

is_mxfp4,

)

from compressed_tensors.utils import align_module_device, update_parameter_data

from compressed_tensors.quantization.utils import is_kv_cache_quant_scheme

from compressed_tensors.utils import align_module_device, update_offload_parameter

from compressed_tensors.quantization.utils import (

is_fp4,

is_kv_cache_quant_scheme,

is_mx,

is_mxfp4,

)

from compressed_tensors.utils import align_module_device, update_offload_parameter

gemini-code-assist · 2025-09-03T08:18:23Z

src/llmcompressor/transformers/compression/quantization_format.py

+        if (
+            weight_args[0].num_bits == 4
+            and weight_args[0].type == QuantizationType.FLOAT.value
+        ):


There's a potential IndexError here if weight_args is empty. Please add a check before accessing weight_args[0].

if ( weight_args and weight_args[0].num_bits == 4 and weight_args[0].type == QuantizationType.FLOAT.value ):

gemini-code-assist · 2025-09-03T08:18:23Z

src/llmcompressor/transformers/compression/quantization_format.py

+            if weight_args[0].is_mx:
+                return CompressionFormat.mxfp4_pack_quantized
+            else:
+                return CompressionFormat.nvfp4_pack_quantized


The function's return type is Optional[List[str]], but this returns a CompressionFormat enum member, which is a type mismatch. All new return statements in this block have the same issue. The return value should be a list containing the string value of the enum.

if weight_args[0].is_mx: return [CompressionFormat.mxfp4_pack_quantized.value] else: return [CompressionFormat.nvfp4_pack_quantized.value]

gemini-code-assist · 2025-09-03T08:18:23Z

examples/quantizing_moe/deepseek_moe_w4a4_nvfp4.py

+from llmcompressor.modifiers.quantization import QuantizationModifier
+from llmcompressor.utils import dispatch_for_generation


These imports should be moved to the top of the file to follow PEP 8 guidelines.

gemini-code-assist · 2025-09-03T08:18:24Z

examples/quantizing_moe/deepseek_moe_w4a4_nvfp4.py

+# # Run the model on vLLM
+# try:
+#     from vllm import LLM, SamplingParams
+
+#     vllm_installed = True
+# except ImportError:
+#     vllm_installed = False
+
+# if vllm_installed:
+#     print("vLLM installed, running using vLLM")
+#     sampling_params = SamplingParams(temperature=0.80, top_p=0.95)
+#     llm = LLM(
+#         model=SAVE_DIR,
+#         tensor_parallel_size=2,
+#         trust_remote_code=True,
+#         max_model_len=1042,
+#         dtype=torch.half,
+#     )
+#     prompts = [
+#         "The capital of France is",
+#         "The president of the US is",
+#         "My name is",
+#     ]
+
+#     outputs = llm.generate(prompts, sampling_params)
+#     print("================= vLLM GENERATION ======================")
+#     for output in outputs:
+#         assert output
+#         prompt = output.prompt
+#         generated_text = output.outputs[0].text
+#         print("PROMPT", prompt)
+#         print("GENERATED TEXT", generated_text)


This large block of commented-out code appears to be leftover from development. It should be removed to keep the example clean and focused.

Signed-off-by: yiliu30 <[email protected]>

yiliu30 and others added 12 commits June 25, 2025 22:30

quant llama 70b

7b67045

Signed-off-by: yiliu30 <[email protected]>

add deepseek w4a4 nvfp4

23087ce

Signed-off-by: yiliu30 <[email protected]>

add mxfp4 support

55dafc4

Signed-off-by: yiliu30 <[email protected]>

fix typo

12ab1b1

Signed-off-by: yiliu30 <[email protected]>

format

a278a81

Signed-off-by: yiliu30 <[email protected]>

Merge branch 'main' into mxfp4

ab51b7f

mxfp8 support

a268d2e

Signed-off-by: yiliu30 <[email protected]>

format code

7c54f6a

Signed-off-by: yiliu30 <[email protected]>

Update src/llmcompressor/modifiers/quantization/calibration.py

e7452e2

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Merge pull request #3 from yiliu30/mxfp8-reland

b48cc49

Mxfp8 reland

fix quant format

f056558

Signed-off-by: yiliu30 <[email protected]>

merge

f950fb7

Signed-off-by: yiliu30 <[email protected]>

gemini-code-assist bot reviewed Sep 3, 2025

View reviewed changes

tmp fix

cf9843b

Signed-off-by: yiliu30 <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dev i #4

Dev i #4

Uh oh!

yiliu30 commented Sep 3, 2025

Uh oh!

github-actions bot commented Sep 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

gemini-code-assist bot Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		from llmcompressor.modifiers.quantization import QuantizationModifier
		from llmcompressor.utils import dispatch_for_generation

Dev i #4

Are you sure you want to change the base?

Dev i #4

Uh oh!

Conversation

yiliu30 commented Sep 3, 2025

Uh oh!

github-actions bot commented Sep 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants