-
Notifications
You must be signed in to change notification settings - Fork 1
Dev i #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Mxfp8 reland
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @yiliu30, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request primarily focuses on enhancing the quantization capabilities and examples within the llmcompressor library. It introduces a new example for quantizing DeepSeek MoE models, refines the existing Llama3 quantization example for greater flexibility, and significantly improves the underlying logic for handling and inferring various quantization formats, particularly for FP4 and MXFP4 schemes. These changes aim to broaden the applicability and robustness of the quantization process.
Highlights
- Updated Llama3 Quantization Example: The
llama3_example.pyscript has been updated to allow for more flexible model ID and quantization scheme selection, including local model paths and MXFP4/NVFP4 schemes. The number of calibration samples was also reduced for quicker testing. - New DeepSeek MoE Quantization Example: A new example script
deepseek_moe_w4a4_nvfp4.pyhas been added to demonstrate 4-bit quantization (NVFP4) for DeepSeek Mixture of Experts (MoE) models, including dataset preparation and model saving. - Enhanced Quantization Calibration Logic: Modifications in
calibration.pyimprove the handling of global scales for MX quantization schemes and refine the conditions for calculating activation quantization parameters based on FP4 schemes. - Refined Quantization Format Inference: The logic for inferring and setting per-module quantization formats in
quantization_format.pyhas been significantly refactored to support various packed and unpacked quantization formats (e.g., NVFP4, MXFP4, Marlin 2:4, int4/8, float8) based on the specific quantization arguments. - Improved Quantization Helper Utilities: Helper functions in
helpers.pynow include checks for FP4 quantization schemes when validating tensor group quantization, ensuring compatibility with new quantization formats.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds support for new quantization formats, including MXFP4, and provides corresponding examples. While the core logic changes are a good step forward, the example scripts contain hardcoded, user-specific paths and leftover debugging code that must be removed to make them usable. Additionally, there are several issues in the library code, such as messy imports, a potential IndexError, and a return type mismatch, which impact maintainability and correctness. These issues should be addressed before merging.
| MODEL_ID = "/data5/yliu7/HF_HOME/meta-llama/Llama-3.2-1B-Instruct/" | ||
| # MODEL_ID = "meta-llama/Llama-3.3-70B-Instruct" | ||
| scheme_name = "NVFP4" | ||
| scheme_name = "MXFP4" | ||
| # scheme_name = "MXFP8" | ||
| # scheme_name = "FP8" | ||
|
|
||
| SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + f"-{scheme_name}" | ||
| SAVE_DIR = f"/data5/yliu7/HF_HOME/{SAVE_DIR}" | ||
| print(f"Saving to {SAVE_DIR}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| MODEL_ID = "deepseek-ai/DeepSeek-V2.5" | ||
| MODEL_ID = "/data0/deepseek-ai/DeepSeek-V2-Lite" | ||
| MODEL_ID = "/data0/deepseek-ai/DeepSeek-R1" | ||
| MODEL_ID = "/data1/DeepSeek-R1-bf16" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example script hardcodes multiple MODEL_IDs, including local, user-specific paths. Please use a single, public model ID from the Hugging Face Hub to ensure the example is runnable by others.
| MODEL_ID = "deepseek-ai/DeepSeek-V2.5" | |
| MODEL_ID = "/data0/deepseek-ai/DeepSeek-V2-Lite" | |
| MODEL_ID = "/data0/deepseek-ai/DeepSeek-R1" | |
| MODEL_ID = "/data1/DeepSeek-R1-bf16" | |
| MODEL_ID = "deepseek-ai/DeepSeek-V2-Lite" |
| from compressed_tensors.quantization.utils import ( | ||
| is_fp4, | ||
| is_kv_cache_quant_scheme, | ||
| is_mx, | ||
| is_mxfp4, | ||
| ) | ||
| from compressed_tensors.utils import align_module_device, update_parameter_data | ||
| from compressed_tensors.quantization.utils import is_kv_cache_quant_scheme | ||
| from compressed_tensors.utils import align_module_device, update_offload_parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are duplicate imports in this block. is_kv_cache_quant_scheme and align_module_device are imported twice. Additionally, update_parameter_data is imported but does not appear to be used in this file. Please consolidate and clean up the imports to improve code clarity and maintainability.
| from compressed_tensors.quantization.utils import ( | |
| is_fp4, | |
| is_kv_cache_quant_scheme, | |
| is_mx, | |
| is_mxfp4, | |
| ) | |
| from compressed_tensors.utils import align_module_device, update_parameter_data | |
| from compressed_tensors.quantization.utils import is_kv_cache_quant_scheme | |
| from compressed_tensors.utils import align_module_device, update_offload_parameter | |
| from compressed_tensors.quantization.utils import ( | |
| is_fp4, | |
| is_kv_cache_quant_scheme, | |
| is_mx, | |
| is_mxfp4, | |
| ) | |
| from compressed_tensors.utils import align_module_device, update_offload_parameter |
| if ( | ||
| weight_args[0].num_bits == 4 | ||
| and weight_args[0].type == QuantizationType.FLOAT.value | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if weight_args[0].is_mx: | ||
| return CompressionFormat.mxfp4_pack_quantized | ||
| else: | ||
| return CompressionFormat.nvfp4_pack_quantized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function's return type is Optional[List[str]], but this returns a CompressionFormat enum member, which is a type mismatch. All new return statements in this block have the same issue. The return value should be a list containing the string value of the enum.
if weight_args[0].is_mx:
return [CompressionFormat.mxfp4_pack_quantized.value]
else:
return [CompressionFormat.nvfp4_pack_quantized.value]| from llmcompressor.modifiers.quantization import QuantizationModifier | ||
| from llmcompressor.utils import dispatch_for_generation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # # Run the model on vLLM | ||
| # try: | ||
| # from vllm import LLM, SamplingParams | ||
|
|
||
| # vllm_installed = True | ||
| # except ImportError: | ||
| # vllm_installed = False | ||
|
|
||
| # if vllm_installed: | ||
| # print("vLLM installed, running using vLLM") | ||
| # sampling_params = SamplingParams(temperature=0.80, top_p=0.95) | ||
| # llm = LLM( | ||
| # model=SAVE_DIR, | ||
| # tensor_parallel_size=2, | ||
| # trust_remote_code=True, | ||
| # max_model_len=1042, | ||
| # dtype=torch.half, | ||
| # ) | ||
| # prompts = [ | ||
| # "The capital of France is", | ||
| # "The president of the US is", | ||
| # "My name is", | ||
| # ] | ||
|
|
||
| # outputs = llm.generate(prompts, sampling_params) | ||
| # print("================= vLLM GENERATION ======================") | ||
| # for output in outputs: | ||
| # assert output | ||
| # prompt = output.prompt | ||
| # generated_text = output.outputs[0].text | ||
| # print("PROMPT", prompt) | ||
| # print("GENERATED TEXT", generated_text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Signed-off-by: yiliu30 <[email protected]>
SUMMARY:
"please provide a brief summary"
TEST PLAN:
"please outline how the changes were tested"