Skip to content

Conversation

@yiliu30
Copy link
Owner

@yiliu30 yiliu30 commented Sep 3, 2025

SUMMARY:
"please provide a brief summary"

TEST PLAN:
"please outline how the changes were tested"

yiliu30 and others added 12 commits June 25, 2025 22:30
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
@github-actions
Copy link

github-actions bot commented Sep 3, 2025

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on enhancing the quantization capabilities and examples within the llmcompressor library. It introduces a new example for quantizing DeepSeek MoE models, refines the existing Llama3 quantization example for greater flexibility, and significantly improves the underlying logic for handling and inferring various quantization formats, particularly for FP4 and MXFP4 schemes. These changes aim to broaden the applicability and robustness of the quantization process.

Highlights

  • Updated Llama3 Quantization Example: The llama3_example.py script has been updated to allow for more flexible model ID and quantization scheme selection, including local model paths and MXFP4/NVFP4 schemes. The number of calibration samples was also reduced for quicker testing.
  • New DeepSeek MoE Quantization Example: A new example script deepseek_moe_w4a4_nvfp4.py has been added to demonstrate 4-bit quantization (NVFP4) for DeepSeek Mixture of Experts (MoE) models, including dataset preparation and model saving.
  • Enhanced Quantization Calibration Logic: Modifications in calibration.py improve the handling of global scales for MX quantization schemes and refine the conditions for calculating activation quantization parameters based on FP4 schemes.
  • Refined Quantization Format Inference: The logic for inferring and setting per-module quantization formats in quantization_format.py has been significantly refactored to support various packed and unpacked quantization formats (e.g., NVFP4, MXFP4, Marlin 2:4, int4/8, float8) based on the specific quantization arguments.
  • Improved Quantization Helper Utilities: Helper functions in helpers.py now include checks for FP4 quantization schemes when validating tensor group quantization, ensuring compatibility with new quantization formats.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for new quantization formats, including MXFP4, and provides corresponding examples. While the core logic changes are a good step forward, the example scripts contain hardcoded, user-specific paths and leftover debugging code that must be removed to make them usable. Additionally, there are several issues in the library code, such as messy imports, a potential IndexError, and a return type mismatch, which impact maintainability and correctness. These issues should be addressed before merging.

Comment on lines +9 to +18
MODEL_ID = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/"
# MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct"
scheme_name = "NVFP4"
scheme_name = "MXFP4"
# scheme_name = "MXFP8"
# scheme_name = "FP8"

SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + f"-{scheme_name}"
SAVE_DIR = f"/data5/yliu7/HF_HOME/{SAVE_DIR}"
print(f"Saving to {SAVE_DIR}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This block contains hardcoded, user-specific paths, multiple re-definitions of variables, and a debug print statement. This appears to be temporary development code that should be removed before merging, as it makes the example non-runnable for other users.

Comment on lines +14 to +17
MODEL_ID = "deepseek-ai/DeepSeek-V2.5"
MODEL_ID = "/data0/deepseek-ai/DeepSeek-V2-Lite"
MODEL_ID = "/data0/deepseek-ai/DeepSeek-R1"
MODEL_ID = "/data1/DeepSeek-R1-bf16"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This example script hardcodes multiple MODEL_IDs, including local, user-specific paths. Please use a single, public model ID from the Hugging Face Hub to ensure the example is runnable by others.

Suggested change
MODEL_ID = "deepseek-ai/DeepSeek-V2.5"
MODEL_ID = "/data0/deepseek-ai/DeepSeek-V2-Lite"
MODEL_ID = "/data0/deepseek-ai/DeepSeek-R1"
MODEL_ID = "/data1/DeepSeek-R1-bf16"
MODEL_ID = "deepseek-ai/DeepSeek-V2-Lite"

Comment on lines +12 to 20
from compressed_tensors.quantization.utils import (
is_fp4,
is_kv_cache_quant_scheme,
is_mx,
is_mxfp4,
)
from compressed_tensors.utils import align_module_device, update_parameter_data
from compressed_tensors.quantization.utils import is_kv_cache_quant_scheme
from compressed_tensors.utils import align_module_device, update_offload_parameter

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There are duplicate imports in this block. is_kv_cache_quant_scheme and align_module_device are imported twice. Additionally, update_parameter_data is imported but does not appear to be used in this file. Please consolidate and clean up the imports to improve code clarity and maintainability.

Suggested change
from compressed_tensors.quantization.utils import (
is_fp4,
is_kv_cache_quant_scheme,
is_mx,
is_mxfp4,
)
from compressed_tensors.utils import align_module_device, update_parameter_data
from compressed_tensors.quantization.utils import is_kv_cache_quant_scheme
from compressed_tensors.utils import align_module_device, update_offload_parameter
from compressed_tensors.quantization.utils import (
is_fp4,
is_kv_cache_quant_scheme,
is_mx,
is_mxfp4,
)
from compressed_tensors.utils import align_module_device, update_offload_parameter

Comment on lines 88 to 91
if (
weight_args[0].num_bits == 4
and weight_args[0].type == QuantizationType.FLOAT.value
):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There's a potential IndexError here if weight_args is empty. Please add a check before accessing weight_args[0].

        if (
            weight_args
            and weight_args[0].num_bits == 4
            and weight_args[0].type == QuantizationType.FLOAT.value
        ):

Comment on lines 92 to 95
if weight_args[0].is_mx:
return CompressionFormat.mxfp4_pack_quantized
else:
return CompressionFormat.nvfp4_pack_quantized

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The function's return type is Optional[List[str]], but this returns a CompressionFormat enum member, which is a type mismatch. All new return statements in this block have the same issue. The return value should be a list containing the string value of the enum.

            if weight_args[0].is_mx:
                return [CompressionFormat.mxfp4_pack_quantized.value]
            else:
                return [CompressionFormat.nvfp4_pack_quantized.value]

Comment on lines +66 to +67
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These imports should be moved to the top of the file to follow PEP 8 guidelines.

Comment on lines +102 to +133
# # Run the model on vLLM
# try:
# from vllm import LLM, SamplingParams

# vllm_installed = True
# except ImportError:
# vllm_installed = False

# if vllm_installed:
# print("vLLM installed, running using vLLM")
# sampling_params = SamplingParams(temperature=0.80, top_p=0.95)
# llm = LLM(
# model=SAVE_DIR,
# tensor_parallel_size=2,
# trust_remote_code=True,
# max_model_len=1042,
# dtype=torch.half,
# )
# prompts = [
# "The capital of France is",
# "The president of the US is",
# "My name is",
# ]

# outputs = llm.generate(prompts, sampling_params)
# print("================= vLLM GENERATION ======================")
# for output in outputs:
# assert output
# prompt = output.prompt
# generated_text = output.outputs[0].text
# print("PROMPT", prompt)
# print("GENERATED TEXT", generated_text)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large block of commented-out code appears to be leftover from development. It should be removed to keep the example clean and focused.

Signed-off-by: yiliu30 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants