Skip to content

[Gemma4] Add MoE Calibration Support#2565

Open
dsikka wants to merge 9 commits intomainfrom
gemma4-support
Open

[Gemma4] Add MoE Calibration Support#2565
dsikka wants to merge 9 commits intomainfrom
gemma4-support

Conversation

@dsikka
Copy link
Copy Markdown
Collaborator

@dsikka dsikka commented Apr 5, 2026

SUMMARY:

Testing:

  • Generated Checkpoints:
  1. https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-FP8-Dynamic
  2. https://huggingface.co/RedHatAI/gemma-4-26B-A4B-it-NVFP4

Summary by CodeRabbit

  • New Features

    • Added support for per-expert calibration and execution for Gemma4 MoE models, enabling unpacked expert-level calibration and quantization workflows.
    • Exposed tooling to save quantized Gemma4 models and their processors after calibration.
  • Documentation

    • Added example scripts demonstrating NVFP4 (W4A4) and FP8 dynamic (W8A8) quantization workflows, calibration dataset preprocessing, and model export.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for Gemma 4 quantization, featuring a new example script and the SequentialGemma4TextExperts module to unpack 3D expert weights for calibration and vLLM compatibility. Feedback identifies a typo in the example script, recommends more descriptive error handling for batch sizes in the data collator, and suggests using torch.no_grad() with copy_() for safer parameter updates.

@dsikka dsikka added the ready When a PR is ready for review label Apr 6, 2026
@dsikka dsikka marked this pull request as ready for review April 6, 2026 17:01
@dsikka dsikka requested a review from kylesayrs as a code owner April 6, 2026 17:01
@dsikka dsikka requested a review from brian-dellabetta April 6, 2026 17:04
@dsikka dsikka enabled auto-merge (squash) April 6, 2026 18:14
@dsikka
Copy link
Copy Markdown
Collaborator Author

dsikka commented Apr 6, 2026

@coderabbitai review this PR

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

@dsikka, I'll kick off a review of this PR right away!

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1b43a0a3-a22a-4354-b38e-bac71a87aaf4

📥 Commits

Reviewing files that changed from the base of the PR and between c4b4c44 and 2fab08a.

📒 Files selected for processing (1)
  • examples/quantization_w8a8_fp8/gemma4_example.py
✅ Files skipped from review due to trivial changes (1)
  • examples/quantization_w8a8_fp8/gemma4_example.py

📝 Walkthrough

Walkthrough

Adds MoE calibration support for Gemma4 by introducing modules that unpack packed expert weights and perform routed expert execution during calibration, plus two new example scripts demonstrating PTQ quantization workflows (NVFP4 and FP8-Dynamic) for the Gemma4 model.

Changes

Cohort / File(s) Summary
Example Quantization Scripts
examples/quantization_w4a4_fp4/gemma4_example.py, examples/quantization_w8a8_fp8/gemma4_example.py
Two new example scripts showing one-shot PTQ quantization for google/gemma-4-26B-A4B-it. They load model+processor, define quantization recipes (NVFP4 and FP8_DYNAMIC), prepare a calibration dataset and collator, run oneshot(...), and save the quantized model and processor.
MoE Calibration Module
src/llmcompressor/modeling/gemma4.py
Adds SequentialGemma4TextExperts (registered as MoECalibrationModule) and Gemma4TextExpertsList. These unpack packed Gemma4 expert weights into per-expert MLPs and implement per-expert routed execution, handling selection of routed tokens, weighted outputs, and accumulation into final hidden states. Review tensor indexing, weight-copying, and routing logic.
Module Exports
src/llmcompressor/modeling/__init__.py
Re-exports SequentialGemma4TextExperts from gemma4.py into the modeling package public namespace.

Sequence Diagram(s)

sequenceDiagram
    participant Calibration as Calibration Process
    participant SeqExperts as SequentialGemma4TextExperts
    participant ExpertsList as Gemma4TextExpertsList
    participant Expert as Gemma4TextMLP_Expert

    Calibration->>SeqExperts: forward(hidden_states, top_k_index, top_k_weights)
    activate SeqExperts
    SeqExperts->>SeqExperts: build one-hot expert mask from top_k_index
    loop For each expert (i)
        SeqExperts->>SeqExperts: select token indices for expert i
        alt calibrate_all_experts
            SeqExperts->>Expert: forward(selected or all tokens) through expert i
        else
            SeqExperts->>Expert: forward only routed tokens through expert i
        end
        Expert-->>SeqExperts: expert output for tokens
        SeqExperts->>SeqExperts: weight outputs by top_k_weights and scatter into final_hidden_states
    end
    SeqExperts-->>Calibration: final_hidden_states
    deactivate SeqExperts
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🐰 I unpack weights with nimble paws and glee,
Each expert hops to teach a token-tree,
Calibration crumbs I nibble through the night,
Quantized dreams in Gemma's glow take flight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title '[Gemma4] Add MoE Calibration Support' directly and clearly describes the main change: adding Mixture-of-Experts calibration support for the Gemma4 model, which aligns with the PR's primary objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gemma4-support

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/quantization_w8a8_fp8/gemma4_example.py`:
- Around line 35-37: The hard-coded SAVE_DIR uses a personal absolute path which
breaks portability; change the SAVE_DIR computation to derive a model-local path
(e.g., base on MODEL_ID name) or make it configurable via an environment
variable/CLI flag, and use os.path.join and safe string handling when building
the path; update the code locations that call model.save_pretrained and
processor.save_pretrained to use the new SAVE_DIR variable (referenced symbols:
MODEL_ID, SAVE_DIR, model.save_pretrained, processor.save_pretrained).

In `@src/llmcompressor/modeling/gemma4.py`:
- Around line 29-45: The constructor currently assumes config has a text_config
attribute; update __init__ to normalize the incoming config so it accepts either
a Gemma4Config (with .text_config) or a Gemma4TextConfig directly: detect if
config has a text_config attribute (or is instance of Gemma4Config) and set a
local text_config = config.text_config otherwise set text_config = config, then
use text_config when constructing Gemma4TextExpertsList (replace the direct use
of config.text_config with the normalized text_config); keep existing parameter
names (original, config, calibrate_all_experts) and other fields (num_experts,
hidden_dim, intermediate_dim, calibrate_all_experts) unchanged.
- Around line 90-116: The unpacked experts' Linear layer metadata must be
patched to reflect the expert dimensions (moe_intermediate_size) instead of the
full MLP sizes: after assigning weights in Gemma4TextExperts.__init__ (inside
the loop over self[i] created from Gemma4TextMLP), set the per-expert attributes
so the quantization pipeline sees the correct shapes — e.g. set
self[i].intermediate_size = intermediate_size; set self[i].gate_proj.in_features
= config.hidden_size and self[i].gate_proj.out_features = intermediate_size; set
self[i].up_proj.in_features = config.hidden_size and
self[i].up_proj.out_features = intermediate_size; set
self[i].down_proj.in_features = intermediate_size and
self[i].down_proj.out_features = config.hidden_size (and any equivalent
.weight.shape-derived metadata) so gate_proj/up_proj/down_proj metadata matches
the copied weights.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b4b0849-2cea-4558-9b6a-d606b17ac04b

📥 Commits

Reviewing files that changed from the base of the PR and between a65b5dd and c4b4c44.

📒 Files selected for processing (4)
  • examples/quantization_w4a4_fp4/gemma4_example.py
  • examples/quantization_w8a8_fp8/gemma4_example.py
  • src/llmcompressor/modeling/__init__.py
  • src/llmcompressor/modeling/gemma4.py

Comment on lines +29 to +45
def __init__(
self,
original: Gemma4TextExperts,
config: Gemma4Config,
calibrate_all_experts: bool = True,
):
super().__init__()
self.num_experts = original.num_experts
self.hidden_dim = original.hidden_dim
self.intermediate_dim = original.intermediate_dim
self.calibrate_all_experts = calibrate_all_experts

# Unpack the 3D expert weights into individual MLP modules
# Register experts directly as numbered children to avoid double nesting
# (HF has layers[i].experts, so we want layers[i].experts.0,
# not layers[i].experts.experts.0)
expert_list = Gemma4TextExpertsList(config.text_config, original)
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In the latest Hugging Face Transformers Gemma4 documentation/source, what config classes do Gemma4TextModel, Gemma4ForCausalLM, and Gemma4ForConditionalGeneration take?

💡 Result:

  • Gemma4TextModel takes Gemma4TextConfig. [1]
  • Gemma4ForCausalLM takes Gemma4TextConfig. [2]
  • Gemma4ForConditionalGeneration takes Gemma4Config. [3]

Sources:
[1] Hugging Face Transformers Gemma4 docs (Gemma4TextModel signature) (huggingface.co)
[2] Hugging Face Transformers Gemma4 docs (Gemma4ForCausalLM signature) (huggingface.co)
[3] Hugging Face Transformers Gemma4 docs (Gemma4ForConditionalGeneration signature) (huggingface.co)

Citations:


🏁 Script executed:

# First, let's check the file at the specified lines
cat -n src/llmcompressor/modeling/gemma4.py | sed -n '20,60p'

Repository: vllm-project/llm-compressor

Length of output: 1820


🏁 Script executed:

# Also check what Gemma4TextExpertsList expects
rg -A 10 "class Gemma4TextExpertsList" src/

Repository: vllm-project/llm-compressor

Length of output: 936


🏁 Script executed:

# Check how SequentialGemma4TextExperts is instantiated
rg -B 5 -A 2 "SequentialGemma4TextExperts" src/

Repository: vllm-project/llm-compressor

Length of output: 1332


🏁 Script executed:

# Check the imports to understand config types
head -20 src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 617


🏁 Script executed:

# Search for where SequentialGemma4TextExperts is instantiated or called
rg "SequentialGemma4TextExperts\|Gemma4TextExperts" src/ --type py -B 2 -A 2 | head -100

Repository: vllm-project/llm-compressor

Length of output: 53


🏁 Script executed:

# Check if there's a test or calibration flow that shows how this is called
find . -name "*.py" -type f -exec grep -l "SequentialGemma4TextExperts\|Gemma4TextExperts" {} \; | head -10

Repository: vllm-project/llm-compressor

Length of output: 252


🏁 Script executed:

# Check the example files to see how configs are passed through the calibration flow
cat src/examples/quantization_w4a4_fp4/gemma4_example.py 2>/dev/null || cat ./examples/quantization_w4a4_fp4/gemma4_example.py

Repository: vllm-project/llm-compressor

Length of output: 2554


🏁 Script executed:

# Also check the other example
cat src/examples/quantization_w8a8_fp8/gemma4_example.py 2>/dev/null || cat ./examples/quantization_w8a8_fp8/gemma4_example.py

Repository: vllm-project/llm-compressor

Length of output: 1325


🏁 Script executed:

# Search for the MoECalibrationModule to understand how it instantiates these modules
rg "class MoECalibrationModule" -A 30 src/

Repository: vllm-project/llm-compressor

Length of output: 2474


Normalize config to handle both Gemma4Config and Gemma4TextConfig at input.

Gemma4ForConditionalGeneration takes Gemma4Config (with nested text_config), while Gemma4TextModel and Gemma4ForCausalLM take Gemma4TextConfig directly. The current code unconditionally accesses config.text_config, which fails if this calibration wrapper is used in a text-only model path. Normalizing the config input makes the module reusable across both entry points.

Suggested fix
-        config: Gemma4Config,
+        config: Gemma4Config | Gemma4TextConfig,
@@
-        expert_list = Gemma4TextExpertsList(config.text_config, original)
+        text_config = getattr(config, "text_config", config)
+        expert_list = Gemma4TextExpertsList(text_config, original)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modeling/gemma4.py` around lines 29 - 45, The constructor
currently assumes config has a text_config attribute; update __init__ to
normalize the incoming config so it accepts either a Gemma4Config (with
.text_config) or a Gemma4TextConfig directly: detect if config has a text_config
attribute (or is instance of Gemma4Config) and set a local text_config =
config.text_config otherwise set text_config = config, then use text_config when
constructing Gemma4TextExpertsList (replace the direct use of config.text_config
with the normalized text_config); keep existing parameter names (original,
config, calibrate_all_experts) and other fields (num_experts, hidden_dim,
intermediate_dim, calibrate_all_experts) unchanged.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylesayrs, thanks for confirming! The HuggingFace Transformers source indeed shows Gemma4ForCausalLM.__init__ takes Gemma4TextConfig directly — so config.text_config would raise an AttributeError on the text-only model path.

Here is the concrete fix for src/llmcompressor/modeling/gemma4.py:

 if TYPE_CHECKING:
     from transformers.models.gemma4.configuration_gemma4 import Gemma4TextConfig
     from transformers.models.gemma4.modeling_gemma4 import (
         Gemma4Config,
         Gemma4TextExperts,
     )

 ...

     def __init__(
         self,
         original: Gemma4TextExperts,
-        config: Gemma4Config,
+        config: "Gemma4Config | Gemma4TextConfig",
         calibrate_all_experts: bool = True,
     ):
         super().__init__()
         self.num_experts = original.num_experts
         self.hidden_dim = original.hidden_dim
         self.intermediate_dim = original.intermediate_dim
         self.calibrate_all_experts = calibrate_all_experts

-        expert_list = Gemma4TextExpertsList(config.text_config, original)
+        text_config = getattr(config, "text_config", config)
+        expert_list = Gemma4TextExpertsList(text_config, original)

getattr(config, "text_config", config) safely handles both:

  • Gemma4Config (multimodal) → extracts .text_config
  • Gemma4TextConfig (text-only) → falls back to config itself

✏️ Learnings added
Learnt from: kylesayrs
Repo: vllm-project/llm-compressor PR: 2565
File: src/llmcompressor/modeling/gemma4.py:29-45
Timestamp: 2026-04-06T21:57:15.076Z
Learning: In `src/llmcompressor/modeling/gemma4.py`, `SequentialGemma4TextExperts.__init__` receives a `config` argument that can be either `Gemma4Config` (multimodal, with a `.text_config` attribute) or `Gemma4TextConfig` (text-only, no `.text_config`). The correct normalization is `text_config = getattr(config, "text_config", config)` before passing it to `Gemma4TextExpertsList`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +90 to +116
def __init__(self, config: Gemma4TextConfig, original: Gemma4TextExperts):
from transformers.models.gemma4.modeling_gemma4 import Gemma4TextMLP

self.num_experts = config.num_experts
intermediate_size = config.moe_intermediate_size

with skip_weights_initialize():
super().__init__(
[Gemma4TextMLP(config, layer_idx=0) for _ in range(self.num_experts)]
)

gate_up_data = original.gate_up_proj.data # [num_experts, 2*inter, hidden]
down_data = original.down_proj.data # [num_experts, hidden, inter]

for i in range(self.num_experts):
gate_up = gate_up_data[i] # [2*intermediate, hidden]
down = down_data[i] # [hidden, intermediate]

# gate_up_proj stores [gate; up] stacked along dim 0
# nn.Linear weight is [out_features, in_features]
self[i].gate_proj.weight.data = (
gate_up[:intermediate_size, :].clone().contiguous()
)
self[i].up_proj.weight.data = (
gate_up[intermediate_size:, :].clone().contiguous()
)
self[i].down_proj.weight.data = down.clone().contiguous()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
rg -n -C2 '\.(in_features|out_features)\b' src/llmcompressor

Repository: vllm-project/llm-compressor

Length of output: 1674


🏁 Script executed:

cat -n src/llmcompressor/modeling/gemma4.py | head -150 | tail -80

Repository: vllm-project/llm-compressor

Length of output: 3955


🏁 Script executed:

# Get more context around the file
wc -l src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 111


🏁 Script executed:

# Check if Gemma4TextExperts class exists and understand its full context
rg -A 40 "class Gemma4TextExperts" src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 1537


🏁 Script executed:

# Find where this class is used to understand impact
rg -n "Gemma4TextExperts" src/llmcompressor/

Repository: vllm-project/llm-compressor

Length of output: 937


🏁 Script executed:

# Check if there are more lines after 116 that patch the metadata
tail -n +110 src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 394


🏁 Script executed:

# Check where Gemma4TextMLP comes from and understand what it expects
rg -n "class Gemma4TextMLP|intermediate_size" src/llmcompressor/ | grep -i gemma

Repository: vllm-project/llm-compressor

Length of output: 385


🏁 Script executed:

# Also check if there's documentation or tests showing the mismatch
rg -n "moe_intermediate_size|intermediate_size" src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 274


🏁 Script executed:

# Verify the actual config values to confirm the mismatch
# Search for how config is structured
rg -n "moe_intermediate_size|class Gemma4TextConfig" src/llmcompressor/ -A 3

Repository: vllm-project/llm-compressor

Length of output: 1034


🏁 Script executed:

# Check if there's any recent commit or indication that this was fixed
git log --oneline -20 -- src/llmcompressor/modeling/gemma4.py 2>/dev/null || echo "No git history available"

Repository: vllm-project/llm-compressor

Length of output: 119


🏁 Script executed:

# Check if Gemma4TextMLP accepts intermediate_size parameter like GlmMoeDsaMLP does
rg -n "def __init__.*Gemma4TextMLP|intermediate_size" --type py | head -20

Repository: vllm-project/llm-compressor

Length of output: 2178


🏁 Script executed:

# Look at how GlmMoeDsaMLP differs - it passes intermediate_size param
rg -B 5 -A 15 "class GlmMoeDsaMLP" src/llmcompressor/modeling/glm_moe_dsa.py | head -30

Repository: vllm-project/llm-compressor

Length of output: 53


🏁 Script executed:

# Let's verify the actual parameter signature expected by Gemma4TextMLP
# Check the HF transformers v5.5.0 implementation referenced in the review
rg -n "class Gemma4TextMLP|def __init__" src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 173


🏁 Script executed:

# Check if there are any imports or usage of intermediate_size in relation to Gemma4TextMLP
rg -B 3 -A 10 "Gemma4TextMLP" src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 1152


🏁 Script executed:

# Check if there's any config modification or alternative approach in the full gemma4.py file
cat -n src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 5225


Patch the unpacked experts' Linear layer metadata to match expert shapes.

When copying Gemma4TextExpertsList weights from the packed format, the weight tensors are sized by config.moe_intermediate_size (e.g., 704 on Gemma-4-26B-A4B-it), but Gemma4TextMLP is initialized with the full config, which has config.intermediate_size (e.g., 2112). This leaves in_features, out_features, and intermediate_size describing the dense MLP, not the expert dimensions. The quantization pipeline (e.g., imatrix observer, AWQ balancing) reads these attributes to validate shapes; the mismatch will cause incorrect quantization behavior.

Required metadata patch
             for i in range(self.num_experts):
                 gate_up = gate_up_data[i]  # [2*intermediate, hidden]
                 down = down_data[i]  # [hidden, intermediate]
@@ -109,9 +109,15 @@ class Gemma4TextExpertsList(torch.nn.ModuleList):
                 self[i].gate_proj.weight.data = (
                     gate_up[:intermediate_size, :].clone().contiguous()
                 )
+                self[i].gate_proj.in_features = config.hidden_size
+                self[i].gate_proj.out_features = intermediate_size
                 self[i].up_proj.weight.data = (
                     gate_up[intermediate_size:, :].clone().contiguous()
                 )
+                self[i].up_proj.in_features = config.hidden_size
+                self[i].up_proj.out_features = intermediate_size
                 self[i].down_proj.weight.data = down.clone().contiguous()
+                self[i].down_proj.in_features = intermediate_size
+                self[i].down_proj.out_features = config.hidden_size
+                self[i].intermediate_size = intermediate_size
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def __init__(self, config: Gemma4TextConfig, original: Gemma4TextExperts):
from transformers.models.gemma4.modeling_gemma4 import Gemma4TextMLP
self.num_experts = config.num_experts
intermediate_size = config.moe_intermediate_size
with skip_weights_initialize():
super().__init__(
[Gemma4TextMLP(config, layer_idx=0) for _ in range(self.num_experts)]
)
gate_up_data = original.gate_up_proj.data # [num_experts, 2*inter, hidden]
down_data = original.down_proj.data # [num_experts, hidden, inter]
for i in range(self.num_experts):
gate_up = gate_up_data[i] # [2*intermediate, hidden]
down = down_data[i] # [hidden, intermediate]
# gate_up_proj stores [gate; up] stacked along dim 0
# nn.Linear weight is [out_features, in_features]
self[i].gate_proj.weight.data = (
gate_up[:intermediate_size, :].clone().contiguous()
)
self[i].up_proj.weight.data = (
gate_up[intermediate_size:, :].clone().contiguous()
)
self[i].down_proj.weight.data = down.clone().contiguous()
def __init__(self, config: Gemma4TextConfig, original: Gemma4TextExperts):
from transformers.models.gemma4.modeling_gemma4 import Gemma4TextMLP
self.num_experts = config.num_experts
intermediate_size = config.moe_intermediate_size
with skip_weights_initialize():
super().__init__(
[Gemma4TextMLP(config, layer_idx=0) for _ in range(self.num_experts)]
)
gate_up_data = original.gate_up_proj.data # [num_experts, 2*inter, hidden]
down_data = original.down_proj.data # [num_experts, hidden, inter]
for i in range(self.num_experts):
gate_up = gate_up_data[i] # [2*intermediate, hidden]
down = down_data[i] # [hidden, intermediate]
# gate_up_proj stores [gate; up] stacked along dim 0
# nn.Linear weight is [out_features, in_features]
self[i].gate_proj.weight.data = (
gate_up[:intermediate_size, :].clone().contiguous()
)
self[i].gate_proj.in_features = config.hidden_size
self[i].gate_proj.out_features = intermediate_size
self[i].up_proj.weight.data = (
gate_up[intermediate_size:, :].clone().contiguous()
)
self[i].up_proj.in_features = config.hidden_size
self[i].up_proj.out_features = intermediate_size
self[i].down_proj.weight.data = down.clone().contiguous()
self[i].down_proj.in_features = intermediate_size
self[i].down_proj.out_features = config.hidden_size
self[i].intermediate_size = intermediate_size
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/llmcompressor/modeling/gemma4.py` around lines 90 - 116, The unpacked
experts' Linear layer metadata must be patched to reflect the expert dimensions
(moe_intermediate_size) instead of the full MLP sizes: after assigning weights
in Gemma4TextExperts.__init__ (inside the loop over self[i] created from
Gemma4TextMLP), set the per-expert attributes so the quantization pipeline sees
the correct shapes — e.g. set self[i].intermediate_size = intermediate_size; set
self[i].gate_proj.in_features = config.hidden_size and
self[i].gate_proj.out_features = intermediate_size; set
self[i].up_proj.in_features = config.hidden_size and
self[i].up_proj.out_features = intermediate_size; set
self[i].down_proj.in_features = intermediate_size and
self[i].down_proj.out_features = config.hidden_size (and any equivalent
.weight.shape-derived metadata) so gate_proj/up_proj/down_proj metadata matches
the copied weights.

oneshot(model=model, recipe=recipe)

# Save to disk in compressed-tensors format.
SAVE_DIR = "/raid/engine/dsikka/" + MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change path

Comment on lines +29 to +45
def __init__(
self,
original: Gemma4TextExperts,
config: Gemma4Config,
calibrate_all_experts: bool = True,
):
super().__init__()
self.num_experts = original.num_experts
self.hidden_dim = original.hidden_dim
self.intermediate_dim = original.intermediate_dim
self.calibrate_all_experts = calibrate_all_experts

# Unpack the 3D expert weights into individual MLP modules
# Register experts directly as numbered children to avoid double nesting
# (HF has layers[i].experts, so we want layers[i].experts.0,
# not layers[i].experts.experts.0)
expert_list = Gemma4TextExpertsList(config.text_config, original)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# gate_up_proj stores [gate; up] stacked along dim 0
# nn.Linear weight is [out_features, in_features]
self[i].gate_proj.weight.data = (
gate_up[:intermediate_size, :].clone().contiguous()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an out-of-scope problem, but creating the new Gemma4TextExpertsList class while the original class lives causes a memory spike, which may cause oom for large models (for example, kimi-k2 is 34Gb, doubling that to 68Gb is a large cost).

# gate_up_proj stores [gate; up] stacked along dim 0
# nn.Linear weight is [out_features, in_features]
self[i].gate_proj.weight.data = (
gate_up[:intermediate_size, :].clone().contiguous()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clone and contiguous are redundant

key: (
torch.tensor(value)
if key != "pixel_values"
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
else torch.tensor(value, dtype=model.dtype).squeeze(0)

Code changes was requested by @dsikka.

*
#2565 (comment)

The following files were modified:

* `examples/quantization_w8a8_fp8/gemma4_example.py`

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 6, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@mergify mergify bot removed the quality-failed label Apr 6, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 6, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@mergify mergify bot removed the quality-failed label Apr 6, 2026
@dsikka
Copy link
Copy Markdown
Collaborator Author

dsikka commented Apr 6, 2026

@coderabbitai review and add labels

@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 6, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@mergify mergify bot removed the quality-failed label Apr 7, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify bot commented Apr 7, 2026

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

@dsikka
Copy link
Copy Markdown
Collaborator Author

dsikka commented Apr 7, 2026

@coderabbitai Review and label

@dsikka
Copy link
Copy Markdown
Collaborator Author

dsikka commented Apr 7, 2026

@coderabbitai review

@dsikka
Copy link
Copy Markdown
Collaborator Author

dsikka commented Apr 7, 2026

@coderabbitai full review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quality-failed ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants