[Gemma4] Add MoE Calibration Support by dsikka · Pull Request #2565 · vllm-project/llm-compressor

dsikka · 2026-04-05T21:10:37Z

SUMMARY:

Add MoE calibration support
Add e2e FP8 and NVFp4 examples
Requires; [Gemma4] Support quantized MoE vllm#39045 for vLLM serving

Testing:

Generated Checkpoints:

Summary by CodeRabbit

New Features
- Added support for per-expert calibration and execution for Gemma4 MoE models, enabling unpacked expert-level calibration and quantization workflows.
- Exposed tooling to save quantized Gemma4 models and their processors after calibration.
Documentation
- Added example scripts demonstrating NVFP4 (W4A4) and FP8 dynamic (W8A8) quantization workflows, calibration dataset preprocessing, and model export.

github-actions · 2026-04-05T21:10:49Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

gemini-code-assist

Code Review

This pull request adds support for Gemma 4 quantization, featuring a new example script and the SequentialGemma4TextExperts module to unpack 3D expert weights for calibration and vLLM compatibility. Feedback identifies a typo in the example script, recommends more descriptive error handling for batch sizes in the data collator, and suggests using torch.no_grad() with copy_() for safer parameter updates.

examples/quantization_w4a4_fp4/gemma4_example.py

src/llmcompressor/modeling/gemma4.py

dsikka · 2026-04-06T20:59:28Z

@coderabbitai review this PR

coderabbitai · 2026-04-06T20:59:42Z

@dsikka, I'll kick off a review of this PR right away!

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

coderabbitai · 2026-04-06T20:59:52Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1b43a0a3-a22a-4354-b38e-bac71a87aaf4

📥 Commits

Reviewing files that changed from the base of the PR and between c4b4c44 and 2fab08a.

📒 Files selected for processing (1)

examples/quantization_w8a8_fp8/gemma4_example.py

✅ Files skipped from review due to trivial changes (1)

examples/quantization_w8a8_fp8/gemma4_example.py

📝 Walkthrough

Walkthrough

Adds MoE calibration support for Gemma4 by introducing modules that unpack packed expert weights and perform routed expert execution during calibration, plus two new example scripts demonstrating PTQ quantization workflows (NVFP4 and FP8-Dynamic) for the Gemma4 model.

Changes

Cohort / File(s)	Summary
Example Quantization Scripts `examples/quantization_w4a4_fp4/gemma4_example.py`, `examples/quantization_w8a8_fp8/gemma4_example.py`	Two new example scripts showing one-shot PTQ quantization for `google/gemma-4-26B-A4B-it`. They load model+processor, define quantization recipes (NVFP4 and FP8_DYNAMIC), prepare a calibration dataset and collator, run `oneshot(...)`, and save the quantized model and processor.
MoE Calibration Module `src/llmcompressor/modeling/gemma4.py`	Adds `SequentialGemma4TextExperts` (registered as `MoECalibrationModule`) and `Gemma4TextExpertsList`. These unpack packed Gemma4 expert weights into per-expert MLPs and implement per-expert routed execution, handling selection of routed tokens, weighted outputs, and accumulation into final hidden states. Review tensor indexing, weight-copying, and routing logic.
Module Exports `src/llmcompressor/modeling/__init__.py`	Re-exports `SequentialGemma4TextExperts` from `gemma4.py` into the modeling package public namespace.

Sequence Diagram(s)

sequenceDiagram
    participant Calibration as Calibration Process
    participant SeqExperts as SequentialGemma4TextExperts
    participant ExpertsList as Gemma4TextExpertsList
    participant Expert as Gemma4TextMLP_Expert

    Calibration->>SeqExperts: forward(hidden_states, top_k_index, top_k_weights)
    activate SeqExperts
    SeqExperts->>SeqExperts: build one-hot expert mask from top_k_index
    loop For each expert (i)
        SeqExperts->>SeqExperts: select token indices for expert i
        alt calibrate_all_experts
            SeqExperts->>Expert: forward(selected or all tokens) through expert i
        else
            SeqExperts->>Expert: forward only routed tokens through expert i
        end
        Expert-->>SeqExperts: expert output for tokens
        SeqExperts->>SeqExperts: weight outputs by top_k_weights and scatter into final_hidden_states
    end
    SeqExperts-->>Calibration: final_hidden_states
    deactivate SeqExperts

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🐰 I unpack weights with nimble paws and glee,
Each expert hops to teach a token-tree,
Calibration crumbs I nibble through the night,
Quantized dreams in Gemma's glow take flight! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title '[Gemma4] Add MoE Calibration Support' directly and clearly describes the main change: adding Mixture-of-Experts calibration support for the Gemma4 model, which aligns with the PR's primary objectives.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch gemma4-support

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/quantization_w8a8_fp8/gemma4_example.py`:
- Around line 35-37: The hard-coded SAVE_DIR uses a personal absolute path which
breaks portability; change the SAVE_DIR computation to derive a model-local path
(e.g., base on MODEL_ID name) or make it configurable via an environment
variable/CLI flag, and use os.path.join and safe string handling when building
the path; update the code locations that call model.save_pretrained and
processor.save_pretrained to use the new SAVE_DIR variable (referenced symbols:
MODEL_ID, SAVE_DIR, model.save_pretrained, processor.save_pretrained).

In `@src/llmcompressor/modeling/gemma4.py`:
- Around line 29-45: The constructor currently assumes config has a text_config
attribute; update __init__ to normalize the incoming config so it accepts either
a Gemma4Config (with .text_config) or a Gemma4TextConfig directly: detect if
config has a text_config attribute (or is instance of Gemma4Config) and set a
local text_config = config.text_config otherwise set text_config = config, then
use text_config when constructing Gemma4TextExpertsList (replace the direct use
of config.text_config with the normalized text_config); keep existing parameter
names (original, config, calibrate_all_experts) and other fields (num_experts,
hidden_dim, intermediate_dim, calibrate_all_experts) unchanged.
- Around line 90-116: The unpacked experts' Linear layer metadata must be
patched to reflect the expert dimensions (moe_intermediate_size) instead of the
full MLP sizes: after assigning weights in Gemma4TextExperts.__init__ (inside
the loop over self[i] created from Gemma4TextMLP), set the per-expert attributes
so the quantization pipeline sees the correct shapes — e.g. set
self[i].intermediate_size = intermediate_size; set self[i].gate_proj.in_features
= config.hidden_size and self[i].gate_proj.out_features = intermediate_size; set
self[i].up_proj.in_features = config.hidden_size and
self[i].up_proj.out_features = intermediate_size; set
self[i].down_proj.in_features = intermediate_size and
self[i].down_proj.out_features = config.hidden_size (and any equivalent
.weight.shape-derived metadata) so gate_proj/up_proj/down_proj metadata matches
the copied weights.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b4b0849-2cea-4558-9b6a-d606b17ac04b

📥 Commits

Reviewing files that changed from the base of the PR and between a65b5dd and c4b4c44.

📒 Files selected for processing (4)

examples/quantization_w4a4_fp4/gemma4_example.py
examples/quantization_w8a8_fp8/gemma4_example.py
src/llmcompressor/modeling/__init__.py
src/llmcompressor/modeling/gemma4.py

examples/quantization_w8a8_fp8/gemma4_example.py

coderabbitai · 2026-04-06T21:16:05Z

src/llmcompressor/modeling/gemma4.py

+    def __init__(
+        self,
+        original: Gemma4TextExperts,
+        config: Gemma4Config,
+        calibrate_all_experts: bool = True,
+    ):
+        super().__init__()
+        self.num_experts = original.num_experts
+        self.hidden_dim = original.hidden_dim
+        self.intermediate_dim = original.intermediate_dim
+        self.calibrate_all_experts = calibrate_all_experts
+
+        # Unpack the 3D expert weights into individual MLP modules
+        # Register experts directly as numbered children to avoid double nesting
+        # (HF has layers[i].experts, so we want layers[i].experts.0,
+        # not layers[i].experts.experts.0)
+        expert_list = Gemma4TextExpertsList(config.text_config, original)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🌐 Web query:

In the latest Hugging Face Transformers Gemma4 documentation/source, what config classes do Gemma4TextModel, Gemma4ForCausalLM, and Gemma4ForConditionalGeneration take?

💡 Result:

Gemma4TextModel takes Gemma4TextConfig. [1]

Gemma4ForCausalLM takes Gemma4TextConfig. [2]

Gemma4ForConditionalGeneration takes Gemma4Config. [3]

Sources:
[1] Hugging Face Transformers Gemma4 docs (Gemma4TextModel signature) (huggingface.co)
[2] Hugging Face Transformers Gemma4 docs (Gemma4ForCausalLM signature) (huggingface.co)
[3] Hugging Face Transformers Gemma4 docs (Gemma4ForConditionalGeneration signature) (huggingface.co)

Citations:

1: https://huggingface.co/docs/transformers/model_doc/gemma4

2: https://huggingface.co/docs/transformers/model_doc/gemma4

3: https://huggingface.co/docs/transformers/model_doc/gemma4

🏁 Script executed:

# First, let's check the file at the specified lines cat -n src/llmcompressor/modeling/gemma4.py | sed -n '20,60p'

Repository: vllm-project/llm-compressor

Length of output: 1820

🏁 Script executed:

# Also check what Gemma4TextExpertsList expects rg -A 10 "class Gemma4TextExpertsList" src/

Repository: vllm-project/llm-compressor

Length of output: 936

🏁 Script executed:

# Check how SequentialGemma4TextExperts is instantiated rg -B 5 -A 2 "SequentialGemma4TextExperts" src/

Repository: vllm-project/llm-compressor

Length of output: 1332

🏁 Script executed:

# Check the imports to understand config types head -20 src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 617

🏁 Script executed:

# Search for where SequentialGemma4TextExperts is instantiated or called rg "SequentialGemma4TextExperts\|Gemma4TextExperts" src/ --type py -B 2 -A 2 | head -100

Repository: vllm-project/llm-compressor

Length of output: 53

🏁 Script executed:

# Check if there's a test or calibration flow that shows how this is called find . -name "*.py" -type f -exec grep -l "SequentialGemma4TextExperts\|Gemma4TextExperts" {} \; | head -10

Repository: vllm-project/llm-compressor

Length of output: 252

🏁 Script executed:

# Check the example files to see how configs are passed through the calibration flow cat src/examples/quantization_w4a4_fp4/gemma4_example.py 2>/dev/null || cat ./examples/quantization_w4a4_fp4/gemma4_example.py

Repository: vllm-project/llm-compressor

Length of output: 2554

🏁 Script executed:

# Also check the other example cat src/examples/quantization_w8a8_fp8/gemma4_example.py 2>/dev/null || cat ./examples/quantization_w8a8_fp8/gemma4_example.py

Repository: vllm-project/llm-compressor

Length of output: 1325

🏁 Script executed:

# Search for the MoECalibrationModule to understand how it instantiates these modules rg "class MoECalibrationModule" -A 30 src/

Repository: vllm-project/llm-compressor

Length of output: 2474

Normalize config to handle both Gemma4Config and Gemma4TextConfig at input.

Gemma4ForConditionalGeneration takes Gemma4Config (with nested text_config), while Gemma4TextModel and Gemma4ForCausalLM take Gemma4TextConfig directly. The current code unconditionally accesses config.text_config, which fails if this calibration wrapper is used in a text-only model path. Normalizing the config input makes the module reusable across both entry points.

Suggested fix

- config: Gemma4Config, + config: Gemma4Config | Gemma4TextConfig, @@ - expert_list = Gemma4TextExpertsList(config.text_config, original) + text_config = getattr(config, "text_config", config) + expert_list = Gemma4TextExpertsList(text_config, original)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/modeling/gemma4.py` around lines 29 - 45, The constructor currently assumes config has a text_config attribute; update __init__ to normalize the incoming config so it accepts either a Gemma4Config (with .text_config) or a Gemma4TextConfig directly: detect if config has a text_config attribute (or is instance of Gemma4Config) and set a local text_config = config.text_config otherwise set text_config = config, then use text_config when constructing Gemma4TextExpertsList (replace the direct use of config.text_config with the normalized text_config); keep existing parameter names (original, config, calibrate_all_experts) and other fields (num_experts, hidden_dim, intermediate_dim, calibrate_all_experts) unchanged.

This is true, GemmaForCausalLM uses a text config
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma4/modeling_gemma4.py#L1686

@kylesayrs, thanks for confirming! The HuggingFace Transformers source indeed shows Gemma4ForCausalLM.__init__ takes Gemma4TextConfig directly — so config.text_config would raise an AttributeError on the text-only model path.

Here is the concrete fix for src/llmcompressor/modeling/gemma4.py:

if TYPE_CHECKING: from transformers.models.gemma4.configuration_gemma4 import Gemma4TextConfig from transformers.models.gemma4.modeling_gemma4 import ( Gemma4Config, Gemma4TextExperts, ) ... def __init__( self, original: Gemma4TextExperts, - config: Gemma4Config, + config: "Gemma4Config | Gemma4TextConfig", calibrate_all_experts: bool = True, ): super().__init__() self.num_experts = original.num_experts self.hidden_dim = original.hidden_dim self.intermediate_dim = original.intermediate_dim self.calibrate_all_experts = calibrate_all_experts - expert_list = Gemma4TextExpertsList(config.text_config, original) + text_config = getattr(config, "text_config", config) + expert_list = Gemma4TextExpertsList(text_config, original)

getattr(config, "text_config", config) safely handles both:

Gemma4Config (multimodal) → extracts .text_config

Gemma4TextConfig (text-only) → falls back to config itself

✏️ Learnings added

Learnt from: kylesayrs Repo: vllm-project/llm-compressor PR: 2565 File: src/llmcompressor/modeling/gemma4.py:29-45 Timestamp: 2026-04-06T21:57:15.076Z Learning: In `src/llmcompressor/modeling/gemma4.py`, `SequentialGemma4TextExperts.__init__` receives a `config` argument that can be either `Gemma4Config` (multimodal, with a `.text_config` attribute) or `Gemma4TextConfig` (text-only, no `.text_config`). The correct normalization is `text_config = getattr(config, "text_config", config)` before passing it to `Gemma4TextExpertsList`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2026-04-06T21:16:06Z

src/llmcompressor/modeling/gemma4.py

+    def __init__(self, config: Gemma4TextConfig, original: Gemma4TextExperts):
+        from transformers.models.gemma4.modeling_gemma4 import Gemma4TextMLP
+
+        self.num_experts = config.num_experts
+        intermediate_size = config.moe_intermediate_size
+
+        with skip_weights_initialize():
+            super().__init__(
+                [Gemma4TextMLP(config, layer_idx=0) for _ in range(self.num_experts)]
+            )
+
+        gate_up_data = original.gate_up_proj.data  # [num_experts, 2*inter, hidden]
+        down_data = original.down_proj.data  # [num_experts, hidden, inter]
+
+        for i in range(self.num_experts):
+            gate_up = gate_up_data[i]  # [2*intermediate, hidden]
+            down = down_data[i]  # [hidden, intermediate]
+
+            # gate_up_proj stores [gate; up] stacked along dim 0
+            # nn.Linear weight is [out_features, in_features]
+            self[i].gate_proj.weight.data = (
+                gate_up[:intermediate_size, :].clone().contiguous()
+            )
+            self[i].up_proj.weight.data = (
+                gate_up[intermediate_size:, :].clone().contiguous()
+            )
+            self[i].down_proj.weight.data = down.clone().contiguous()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash rg -n -C2 '\.(in_features|out_features)\b' src/llmcompressor

Repository: vllm-project/llm-compressor

Length of output: 1674

🏁 Script executed:

cat -n src/llmcompressor/modeling/gemma4.py | head -150 | tail -80

Repository: vllm-project/llm-compressor

Length of output: 3955

🏁 Script executed:

# Get more context around the file wc -l src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 111

🏁 Script executed:

# Check if Gemma4TextExperts class exists and understand its full context rg -A 40 "class Gemma4TextExperts" src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 1537

🏁 Script executed:

# Find where this class is used to understand impact rg -n "Gemma4TextExperts" src/llmcompressor/

Repository: vllm-project/llm-compressor

Length of output: 937

🏁 Script executed:

# Check if there are more lines after 116 that patch the metadata tail -n +110 src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 394

🏁 Script executed:

# Check where Gemma4TextMLP comes from and understand what it expects rg -n "class Gemma4TextMLP|intermediate_size" src/llmcompressor/ | grep -i gemma

Repository: vllm-project/llm-compressor

Length of output: 385

🏁 Script executed:

# Also check if there's documentation or tests showing the mismatch rg -n "moe_intermediate_size|intermediate_size" src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 274

🏁 Script executed:

# Verify the actual config values to confirm the mismatch # Search for how config is structured rg -n "moe_intermediate_size|class Gemma4TextConfig" src/llmcompressor/ -A 3

Repository: vllm-project/llm-compressor

Length of output: 1034

🏁 Script executed:

# Check if there's any recent commit or indication that this was fixed git log --oneline -20 -- src/llmcompressor/modeling/gemma4.py 2>/dev/null || echo "No git history available"

Repository: vllm-project/llm-compressor

Length of output: 119

🏁 Script executed:

# Check if Gemma4TextMLP accepts intermediate_size parameter like GlmMoeDsaMLP does rg -n "def __init__.*Gemma4TextMLP|intermediate_size" --type py | head -20

Repository: vllm-project/llm-compressor

Length of output: 2178

🏁 Script executed:

# Look at how GlmMoeDsaMLP differs - it passes intermediate_size param rg -B 5 -A 15 "class GlmMoeDsaMLP" src/llmcompressor/modeling/glm_moe_dsa.py | head -30

Repository: vllm-project/llm-compressor

Length of output: 53

🏁 Script executed:

# Let's verify the actual parameter signature expected by Gemma4TextMLP # Check the HF transformers v5.5.0 implementation referenced in the review rg -n "class Gemma4TextMLP|def __init__" src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 173

🏁 Script executed:

# Check if there are any imports or usage of intermediate_size in relation to Gemma4TextMLP rg -B 3 -A 10 "Gemma4TextMLP" src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 1152

🏁 Script executed:

# Check if there's any config modification or alternative approach in the full gemma4.py file cat -n src/llmcompressor/modeling/gemma4.py

Repository: vllm-project/llm-compressor

Length of output: 5225

Patch the unpacked experts' Linear layer metadata to match expert shapes.

When copying Gemma4TextExpertsList weights from the packed format, the weight tensors are sized by config.moe_intermediate_size (e.g., 704 on Gemma-4-26B-A4B-it), but Gemma4TextMLP is initialized with the full config, which has config.intermediate_size (e.g., 2112). This leaves in_features, out_features, and intermediate_size describing the dense MLP, not the expert dimensions. The quantization pipeline (e.g., imatrix observer, AWQ balancing) reads these attributes to validate shapes; the mismatch will cause incorrect quantization behavior.

Required metadata patch

for i in range(self.num_experts): gate_up = gate_up_data[i] # [2*intermediate, hidden] down = down_data[i] # [hidden, intermediate] @@ -109,9 +109,15 @@ class Gemma4TextExpertsList(torch.nn.ModuleList): self[i].gate_proj.weight.data = ( gate_up[:intermediate_size, :].clone().contiguous() ) + self[i].gate_proj.in_features = config.hidden_size + self[i].gate_proj.out_features = intermediate_size self[i].up_proj.weight.data = ( gate_up[intermediate_size:, :].clone().contiguous() ) + self[i].up_proj.in_features = config.hidden_size + self[i].up_proj.out_features = intermediate_size self[i].down_proj.weight.data = down.clone().contiguous() + self[i].down_proj.in_features = intermediate_size + self[i].down_proj.out_features = config.hidden_size + self[i].intermediate_size = intermediate_size

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(self, config: Gemma4TextConfig, original: Gemma4TextExperts):

from transformers.models.gemma4.modeling_gemma4 import Gemma4TextMLP

self.num_experts = config.num_experts

intermediate_size = config.moe_intermediate_size

with skip_weights_initialize():

super().__init__(

[Gemma4TextMLP(config, layer_idx=0) for _ in range(self.num_experts)]

)

gate_up_data = original.gate_up_proj.data # [num_experts, 2*inter, hidden]

down_data = original.down_proj.data # [num_experts, hidden, inter]

for i in range(self.num_experts):

gate_up = gate_up_data[i] # [2*intermediate, hidden]

down = down_data[i] # [hidden, intermediate]

# gate_up_proj stores [gate; up] stacked along dim 0

# nn.Linear weight is [out_features, in_features]

self[i].gate_proj.weight.data = (

gate_up[:intermediate_size, :].clone().contiguous()

)

self[i].up_proj.weight.data = (

gate_up[intermediate_size:, :].clone().contiguous()

)

self[i].down_proj.weight.data = down.clone().contiguous()

def __init__(self, config: Gemma4TextConfig, original: Gemma4TextExperts):

from transformers.models.gemma4.modeling_gemma4 import Gemma4TextMLP

self.num_experts = config.num_experts

intermediate_size = config.moe_intermediate_size

with skip_weights_initialize():

super().__init__(

[Gemma4TextMLP(config, layer_idx=0) for _ in range(self.num_experts)]

)

gate_up_data = original.gate_up_proj.data # [num_experts, 2*inter, hidden]

down_data = original.down_proj.data # [num_experts, hidden, inter]

for i in range(self.num_experts):

gate_up = gate_up_data[i] # [2*intermediate, hidden]

down = down_data[i] # [hidden, intermediate]

# gate_up_proj stores [gate; up] stacked along dim 0

# nn.Linear weight is [out_features, in_features]

self[i].gate_proj.weight.data = (

gate_up[:intermediate_size, :].clone().contiguous()

)

self[i].gate_proj.in_features = config.hidden_size

self[i].gate_proj.out_features = intermediate_size

self[i].up_proj.weight.data = (

gate_up[intermediate_size:, :].clone().contiguous()

)

self[i].up_proj.in_features = config.hidden_size

self[i].up_proj.out_features = intermediate_size

self[i].down_proj.weight.data = down.clone().contiguous()

self[i].down_proj.in_features = intermediate_size

self[i].down_proj.out_features = config.hidden_size

self[i].intermediate_size = intermediate_size

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/llmcompressor/modeling/gemma4.py` around lines 90 - 116, The unpacked experts' Linear layer metadata must be patched to reflect the expert dimensions (moe_intermediate_size) instead of the full MLP sizes: after assigning weights in Gemma4TextExperts.__init__ (inside the loop over self[i] created from Gemma4TextMLP), set the per-expert attributes so the quantization pipeline sees the correct shapes — e.g. set self[i].intermediate_size = intermediate_size; set self[i].gate_proj.in_features = config.hidden_size and self[i].gate_proj.out_features = intermediate_size; set self[i].up_proj.in_features = config.hidden_size and self[i].up_proj.out_features = intermediate_size; set self[i].down_proj.in_features = intermediate_size and self[i].down_proj.out_features = config.hidden_size (and any equivalent .weight.shape-derived metadata) so gate_proj/up_proj/down_proj metadata matches the copied weights.

kylesayrs · 2026-04-06T21:48:44Z

examples/quantization_w8a8_fp8/gemma4_example.py

+oneshot(model=model, recipe=recipe)
+
+# Save to disk in compressed-tensors format.
+SAVE_DIR = "/raid/engine/dsikka/" + MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic"


Change path

kylesayrs · 2026-04-06T21:50:43Z

src/llmcompressor/modeling/gemma4.py

+    def __init__(
+        self,
+        original: Gemma4TextExperts,
+        config: Gemma4Config,
+        calibrate_all_experts: bool = True,
+    ):
+        super().__init__()
+        self.num_experts = original.num_experts
+        self.hidden_dim = original.hidden_dim
+        self.intermediate_dim = original.intermediate_dim
+        self.calibrate_all_experts = calibrate_all_experts
+
+        # Unpack the 3D expert weights into individual MLP modules
+        # Register experts directly as numbered children to avoid double nesting
+        # (HF has layers[i].experts, so we want layers[i].experts.0,
+        # not layers[i].experts.experts.0)
+        expert_list = Gemma4TextExpertsList(config.text_config, original)


This is true, GemmaForCausalLM uses a text config
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma4/modeling_gemma4.py#L1686

kylesayrs · 2026-04-06T21:54:00Z

src/llmcompressor/modeling/gemma4.py

+            # gate_up_proj stores [gate; up] stacked along dim 0
+            # nn.Linear weight is [out_features, in_features]
+            self[i].gate_proj.weight.data = (
+                gate_up[:intermediate_size, :].clone().contiguous()


This is an out-of-scope problem, but creating the new Gemma4TextExpertsList class while the original class lives causes a memory spike, which may cause oom for large models (for example, kimi-k2 is 34Gb, doubling that to 68Gb is a large cost).

kylesayrs · 2026-04-06T21:55:41Z

src/llmcompressor/modeling/gemma4.py

+            # gate_up_proj stores [gate; up] stacked along dim 0
+            # nn.Linear weight is [out_features, in_features]
+            self[i].gate_proj.weight.data = (
+                gate_up[:intermediate_size, :].clone().contiguous()


Clone and contiguous are redundant

kylesayrs · 2026-04-06T21:56:36Z

examples/quantization_w4a4_fp4/gemma4_example.py

+        key: (
+            torch.tensor(value)
+            if key != "pixel_values"
+            else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)


Suggested change

else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)

else torch.tensor(value, dtype=model.dtype).squeeze(0)

@dsikka

Code changes was requested by @dsikka. * #2565 (comment) The following files were modified: * `examples/quantization_w8a8_fp8/gemma4_example.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

mergify · 2026-04-06T22:33:42Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-04-06T22:45:50Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

dsikka · 2026-04-06T23:52:01Z

@coderabbitai review and add labels

mergify · 2026-04-06T23:52:33Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

mergify · 2026-04-07T00:05:04Z

The quality checks have failed. Please run make style and make quality under
the root directory to adddress the lint failures. You will need to install the
dev optional install to get the required linting packages:
https://github.com/vllm-project/llm-compressor/blob/main/CONTRIBUTING.md

dsikka · 2026-04-07T00:05:06Z

@coderabbitai Review and label

dsikka · 2026-04-07T00:06:33Z

@coderabbitai review

dsikka · 2026-04-07T00:11:29Z

@coderabbitai full review

add gemma4 calibration support

05239df

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

examples/quantization_w4a4_fp4/gemma4_example.py Show resolved Hide resolved

examples/quantization_w4a4_fp4/gemma4_example.py Show resolved Hide resolved

src/llmcompressor/modeling/gemma4.py Show resolved Hide resolved

dsikka added the ready When a PR is ready for review label Apr 6, 2026

add fp8 example; fix import

3ca7f81

dsikka marked this pull request as ready for review April 6, 2026 17:01

dsikka requested a review from kylesayrs as a code owner April 6, 2026 17:01

dsikka requested a review from brian-dellabetta April 6, 2026 17:04

dsikka added 3 commits April 6, 2026 13:05

fix

8bf7191

Merge branch 'main' into gemma4-support

9c733a3

Merge branch 'main' into gemma4-support

c4b4c44

brian-dellabetta approved these changes Apr 6, 2026

View reviewed changes

dsikka enabled auto-merge (squash) April 6, 2026 18:14

coderabbitai bot reviewed Apr 6, 2026

View reviewed changes

coderabbitai bot mentioned this pull request Apr 6, 2026

📝 CodeRabbit Chat: Implement requested code changes #2572

Merged

kylesayrs approved these changes Apr 6, 2026

View reviewed changes

mergify bot added the quality-failed label Apr 6, 2026

Merge branch 'main' into gemma4-support

e7ea5c1

mergify bot removed the quality-failed label Apr 6, 2026

mergify bot added the quality-failed label Apr 6, 2026

Merge branch 'main' into gemma4-support

f045785

mergify bot removed the quality-failed label Apr 6, 2026

mergify bot added the quality-failed label Apr 6, 2026

Merge branch 'main' into gemma4-support

5c02d42

mergify bot removed the quality-failed label Apr 7, 2026

mergify bot added the quality-failed label Apr 7, 2026

	else torch.tensor(value, dtype=torch.bfloat16).squeeze(0)
	else torch.tensor(value, dtype=model.dtype).squeeze(0)

Conversation

dsikka commented Apr 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Apr 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka commented Apr 6, 2026

Uh oh!

coderabbitai bot commented Apr 6, 2026

Uh oh!

coderabbitai bot commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated Code Review Effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

kylesayrs Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Apr 6, 2026

Uh oh!

mergify bot commented Apr 6, 2026

Uh oh!

dsikka commented Apr 6, 2026

Uh oh!

mergify bot commented Apr 6, 2026

Uh oh!

mergify bot commented Apr 7, 2026

Uh oh!

dsikka commented Apr 7, 2026

Uh oh!

dsikka commented Apr 7, 2026

Uh oh!

dsikka commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

dsikka commented Apr 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 6, 2026 •

edited

Loading

coderabbitai bot Apr 6, 2026 •

edited

Loading