Conversation
|
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
There was a problem hiding this comment.
Code Review
This pull request adds support for Gemma 4 quantization, featuring a new example script and the SequentialGemma4TextExperts module to unpack 3D expert weights for calibration and vLLM compatibility. Feedback identifies a typo in the example script, recommends more descriptive error handling for batch sizes in the data collator, and suggests using torch.no_grad() with copy_() for safer parameter updates.
|
@coderabbitai review this PR |
|
✅ Actions performedReview triggered.
|
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds MoE calibration support for Gemma4 by introducing modules that unpack packed expert weights and perform routed expert execution during calibration, plus two new example scripts demonstrating PTQ quantization workflows (NVFP4 and FP8-Dynamic) for the Gemma4 model. Changes
Sequence Diagram(s)sequenceDiagram
participant Calibration as Calibration Process
participant SeqExperts as SequentialGemma4TextExperts
participant ExpertsList as Gemma4TextExpertsList
participant Expert as Gemma4TextMLP_Expert
Calibration->>SeqExperts: forward(hidden_states, top_k_index, top_k_weights)
activate SeqExperts
SeqExperts->>SeqExperts: build one-hot expert mask from top_k_index
loop For each expert (i)
SeqExperts->>SeqExperts: select token indices for expert i
alt calibrate_all_experts
SeqExperts->>Expert: forward(selected or all tokens) through expert i
else
SeqExperts->>Expert: forward only routed tokens through expert i
end
Expert-->>SeqExperts: expert output for tokens
SeqExperts->>SeqExperts: weight outputs by top_k_weights and scatter into final_hidden_states
end
SeqExperts-->>Calibration: final_hidden_states
deactivate SeqExperts
Estimated Code Review Effort🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/quantization_w8a8_fp8/gemma4_example.py`:
- Around line 35-37: The hard-coded SAVE_DIR uses a personal absolute path which
breaks portability; change the SAVE_DIR computation to derive a model-local path
(e.g., base on MODEL_ID name) or make it configurable via an environment
variable/CLI flag, and use os.path.join and safe string handling when building
the path; update the code locations that call model.save_pretrained and
processor.save_pretrained to use the new SAVE_DIR variable (referenced symbols:
MODEL_ID, SAVE_DIR, model.save_pretrained, processor.save_pretrained).
In `@src/llmcompressor/modeling/gemma4.py`:
- Around line 29-45: The constructor currently assumes config has a text_config
attribute; update __init__ to normalize the incoming config so it accepts either
a Gemma4Config (with .text_config) or a Gemma4TextConfig directly: detect if
config has a text_config attribute (or is instance of Gemma4Config) and set a
local text_config = config.text_config otherwise set text_config = config, then
use text_config when constructing Gemma4TextExpertsList (replace the direct use
of config.text_config with the normalized text_config); keep existing parameter
names (original, config, calibrate_all_experts) and other fields (num_experts,
hidden_dim, intermediate_dim, calibrate_all_experts) unchanged.
- Around line 90-116: The unpacked experts' Linear layer metadata must be
patched to reflect the expert dimensions (moe_intermediate_size) instead of the
full MLP sizes: after assigning weights in Gemma4TextExperts.__init__ (inside
the loop over self[i] created from Gemma4TextMLP), set the per-expert attributes
so the quantization pipeline sees the correct shapes — e.g. set
self[i].intermediate_size = intermediate_size; set self[i].gate_proj.in_features
= config.hidden_size and self[i].gate_proj.out_features = intermediate_size; set
self[i].up_proj.in_features = config.hidden_size and
self[i].up_proj.out_features = intermediate_size; set
self[i].down_proj.in_features = intermediate_size and
self[i].down_proj.out_features = config.hidden_size (and any equivalent
.weight.shape-derived metadata) so gate_proj/up_proj/down_proj metadata matches
the copied weights.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 2b4b0849-2cea-4558-9b6a-d606b17ac04b
📒 Files selected for processing (4)
examples/quantization_w4a4_fp4/gemma4_example.pyexamples/quantization_w8a8_fp8/gemma4_example.pysrc/llmcompressor/modeling/__init__.pysrc/llmcompressor/modeling/gemma4.py
| def __init__( | ||
| self, | ||
| original: Gemma4TextExperts, | ||
| config: Gemma4Config, | ||
| calibrate_all_experts: bool = True, | ||
| ): | ||
| super().__init__() | ||
| self.num_experts = original.num_experts | ||
| self.hidden_dim = original.hidden_dim | ||
| self.intermediate_dim = original.intermediate_dim | ||
| self.calibrate_all_experts = calibrate_all_experts | ||
|
|
||
| # Unpack the 3D expert weights into individual MLP modules | ||
| # Register experts directly as numbered children to avoid double nesting | ||
| # (HF has layers[i].experts, so we want layers[i].experts.0, | ||
| # not layers[i].experts.experts.0) | ||
| expert_list = Gemma4TextExpertsList(config.text_config, original) |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In the latest Hugging Face Transformers Gemma4 documentation/source, what config classes do Gemma4TextModel, Gemma4ForCausalLM, and Gemma4ForConditionalGeneration take?
💡 Result:
Gemma4TextModeltakesGemma4TextConfig. [1]Gemma4ForCausalLMtakesGemma4TextConfig. [2]Gemma4ForConditionalGenerationtakesGemma4Config. [3]
Sources:
[1] Hugging Face Transformers Gemma4 docs (Gemma4TextModel signature) (huggingface.co)
[2] Hugging Face Transformers Gemma4 docs (Gemma4ForCausalLM signature) (huggingface.co)
[3] Hugging Face Transformers Gemma4 docs (Gemma4ForConditionalGeneration signature) (huggingface.co)
Citations:
- 1: https://huggingface.co/docs/transformers/model_doc/gemma4
- 2: https://huggingface.co/docs/transformers/model_doc/gemma4
- 3: https://huggingface.co/docs/transformers/model_doc/gemma4
🏁 Script executed:
# First, let's check the file at the specified lines
cat -n src/llmcompressor/modeling/gemma4.py | sed -n '20,60p'Repository: vllm-project/llm-compressor
Length of output: 1820
🏁 Script executed:
# Also check what Gemma4TextExpertsList expects
rg -A 10 "class Gemma4TextExpertsList" src/Repository: vllm-project/llm-compressor
Length of output: 936
🏁 Script executed:
# Check how SequentialGemma4TextExperts is instantiated
rg -B 5 -A 2 "SequentialGemma4TextExperts" src/Repository: vllm-project/llm-compressor
Length of output: 1332
🏁 Script executed:
# Check the imports to understand config types
head -20 src/llmcompressor/modeling/gemma4.pyRepository: vllm-project/llm-compressor
Length of output: 617
🏁 Script executed:
# Search for where SequentialGemma4TextExperts is instantiated or called
rg "SequentialGemma4TextExperts\|Gemma4TextExperts" src/ --type py -B 2 -A 2 | head -100Repository: vllm-project/llm-compressor
Length of output: 53
🏁 Script executed:
# Check if there's a test or calibration flow that shows how this is called
find . -name "*.py" -type f -exec grep -l "SequentialGemma4TextExperts\|Gemma4TextExperts" {} \; | head -10Repository: vllm-project/llm-compressor
Length of output: 252
🏁 Script executed:
# Check the example files to see how configs are passed through the calibration flow
cat src/examples/quantization_w4a4_fp4/gemma4_example.py 2>/dev/null || cat ./examples/quantization_w4a4_fp4/gemma4_example.pyRepository: vllm-project/llm-compressor
Length of output: 2554
🏁 Script executed:
# Also check the other example
cat src/examples/quantization_w8a8_fp8/gemma4_example.py 2>/dev/null || cat ./examples/quantization_w8a8_fp8/gemma4_example.pyRepository: vllm-project/llm-compressor
Length of output: 1325
🏁 Script executed:
# Search for the MoECalibrationModule to understand how it instantiates these modules
rg "class MoECalibrationModule" -A 30 src/Repository: vllm-project/llm-compressor
Length of output: 2474
Normalize config to handle both Gemma4Config and Gemma4TextConfig at input.
Gemma4ForConditionalGeneration takes Gemma4Config (with nested text_config), while Gemma4TextModel and Gemma4ForCausalLM take Gemma4TextConfig directly. The current code unconditionally accesses config.text_config, which fails if this calibration wrapper is used in a text-only model path. Normalizing the config input makes the module reusable across both entry points.
Suggested fix
- config: Gemma4Config,
+ config: Gemma4Config | Gemma4TextConfig,
@@
- expert_list = Gemma4TextExpertsList(config.text_config, original)
+ text_config = getattr(config, "text_config", config)
+ expert_list = Gemma4TextExpertsList(text_config, original)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llmcompressor/modeling/gemma4.py` around lines 29 - 45, The constructor
currently assumes config has a text_config attribute; update __init__ to
normalize the incoming config so it accepts either a Gemma4Config (with
.text_config) or a Gemma4TextConfig directly: detect if config has a text_config
attribute (or is instance of Gemma4Config) and set a local text_config =
config.text_config otherwise set text_config = config, then use text_config when
constructing Gemma4TextExpertsList (replace the direct use of config.text_config
with the normalized text_config); keep existing parameter names (original,
config, calibrate_all_experts) and other fields (num_experts, hidden_dim,
intermediate_dim, calibrate_all_experts) unchanged.
There was a problem hiding this comment.
This is true, GemmaForCausalLM uses a text config
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma4/modeling_gemma4.py#L1686
There was a problem hiding this comment.
@kylesayrs, thanks for confirming! The HuggingFace Transformers source indeed shows Gemma4ForCausalLM.__init__ takes Gemma4TextConfig directly — so config.text_config would raise an AttributeError on the text-only model path.
Here is the concrete fix for src/llmcompressor/modeling/gemma4.py:
if TYPE_CHECKING:
from transformers.models.gemma4.configuration_gemma4 import Gemma4TextConfig
from transformers.models.gemma4.modeling_gemma4 import (
Gemma4Config,
Gemma4TextExperts,
)
...
def __init__(
self,
original: Gemma4TextExperts,
- config: Gemma4Config,
+ config: "Gemma4Config | Gemma4TextConfig",
calibrate_all_experts: bool = True,
):
super().__init__()
self.num_experts = original.num_experts
self.hidden_dim = original.hidden_dim
self.intermediate_dim = original.intermediate_dim
self.calibrate_all_experts = calibrate_all_experts
- expert_list = Gemma4TextExpertsList(config.text_config, original)
+ text_config = getattr(config, "text_config", config)
+ expert_list = Gemma4TextExpertsList(text_config, original)getattr(config, "text_config", config) safely handles both:
Gemma4Config(multimodal) → extracts.text_configGemma4TextConfig(text-only) → falls back toconfigitself
✏️ Learnings added
Learnt from: kylesayrs
Repo: vllm-project/llm-compressor PR: 2565
File: src/llmcompressor/modeling/gemma4.py:29-45
Timestamp: 2026-04-06T21:57:15.076Z
Learning: In `src/llmcompressor/modeling/gemma4.py`, `SequentialGemma4TextExperts.__init__` receives a `config` argument that can be either `Gemma4Config` (multimodal, with a `.text_config` attribute) or `Gemma4TextConfig` (text-only, no `.text_config`). The correct normalization is `text_config = getattr(config, "text_config", config)` before passing it to `Gemma4TextExpertsList`.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| def __init__(self, config: Gemma4TextConfig, original: Gemma4TextExperts): | ||
| from transformers.models.gemma4.modeling_gemma4 import Gemma4TextMLP | ||
|
|
||
| self.num_experts = config.num_experts | ||
| intermediate_size = config.moe_intermediate_size | ||
|
|
||
| with skip_weights_initialize(): | ||
| super().__init__( | ||
| [Gemma4TextMLP(config, layer_idx=0) for _ in range(self.num_experts)] | ||
| ) | ||
|
|
||
| gate_up_data = original.gate_up_proj.data # [num_experts, 2*inter, hidden] | ||
| down_data = original.down_proj.data # [num_experts, hidden, inter] | ||
|
|
||
| for i in range(self.num_experts): | ||
| gate_up = gate_up_data[i] # [2*intermediate, hidden] | ||
| down = down_data[i] # [hidden, intermediate] | ||
|
|
||
| # gate_up_proj stores [gate; up] stacked along dim 0 | ||
| # nn.Linear weight is [out_features, in_features] | ||
| self[i].gate_proj.weight.data = ( | ||
| gate_up[:intermediate_size, :].clone().contiguous() | ||
| ) | ||
| self[i].up_proj.weight.data = ( | ||
| gate_up[intermediate_size:, :].clone().contiguous() | ||
| ) | ||
| self[i].down_proj.weight.data = down.clone().contiguous() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
rg -n -C2 '\.(in_features|out_features)\b' src/llmcompressorRepository: vllm-project/llm-compressor
Length of output: 1674
🏁 Script executed:
cat -n src/llmcompressor/modeling/gemma4.py | head -150 | tail -80Repository: vllm-project/llm-compressor
Length of output: 3955
🏁 Script executed:
# Get more context around the file
wc -l src/llmcompressor/modeling/gemma4.pyRepository: vllm-project/llm-compressor
Length of output: 111
🏁 Script executed:
# Check if Gemma4TextExperts class exists and understand its full context
rg -A 40 "class Gemma4TextExperts" src/llmcompressor/modeling/gemma4.pyRepository: vllm-project/llm-compressor
Length of output: 1537
🏁 Script executed:
# Find where this class is used to understand impact
rg -n "Gemma4TextExperts" src/llmcompressor/Repository: vllm-project/llm-compressor
Length of output: 937
🏁 Script executed:
# Check if there are more lines after 116 that patch the metadata
tail -n +110 src/llmcompressor/modeling/gemma4.pyRepository: vllm-project/llm-compressor
Length of output: 394
🏁 Script executed:
# Check where Gemma4TextMLP comes from and understand what it expects
rg -n "class Gemma4TextMLP|intermediate_size" src/llmcompressor/ | grep -i gemmaRepository: vllm-project/llm-compressor
Length of output: 385
🏁 Script executed:
# Also check if there's documentation or tests showing the mismatch
rg -n "moe_intermediate_size|intermediate_size" src/llmcompressor/modeling/gemma4.pyRepository: vllm-project/llm-compressor
Length of output: 274
🏁 Script executed:
# Verify the actual config values to confirm the mismatch
# Search for how config is structured
rg -n "moe_intermediate_size|class Gemma4TextConfig" src/llmcompressor/ -A 3Repository: vllm-project/llm-compressor
Length of output: 1034
🏁 Script executed:
# Check if there's any recent commit or indication that this was fixed
git log --oneline -20 -- src/llmcompressor/modeling/gemma4.py 2>/dev/null || echo "No git history available"Repository: vllm-project/llm-compressor
Length of output: 119
🏁 Script executed:
# Check if Gemma4TextMLP accepts intermediate_size parameter like GlmMoeDsaMLP does
rg -n "def __init__.*Gemma4TextMLP|intermediate_size" --type py | head -20Repository: vllm-project/llm-compressor
Length of output: 2178
🏁 Script executed:
# Look at how GlmMoeDsaMLP differs - it passes intermediate_size param
rg -B 5 -A 15 "class GlmMoeDsaMLP" src/llmcompressor/modeling/glm_moe_dsa.py | head -30Repository: vllm-project/llm-compressor
Length of output: 53
🏁 Script executed:
# Let's verify the actual parameter signature expected by Gemma4TextMLP
# Check the HF transformers v5.5.0 implementation referenced in the review
rg -n "class Gemma4TextMLP|def __init__" src/llmcompressor/modeling/gemma4.pyRepository: vllm-project/llm-compressor
Length of output: 173
🏁 Script executed:
# Check if there are any imports or usage of intermediate_size in relation to Gemma4TextMLP
rg -B 3 -A 10 "Gemma4TextMLP" src/llmcompressor/modeling/gemma4.pyRepository: vllm-project/llm-compressor
Length of output: 1152
🏁 Script executed:
# Check if there's any config modification or alternative approach in the full gemma4.py file
cat -n src/llmcompressor/modeling/gemma4.pyRepository: vllm-project/llm-compressor
Length of output: 5225
Patch the unpacked experts' Linear layer metadata to match expert shapes.
When copying Gemma4TextExpertsList weights from the packed format, the weight tensors are sized by config.moe_intermediate_size (e.g., 704 on Gemma-4-26B-A4B-it), but Gemma4TextMLP is initialized with the full config, which has config.intermediate_size (e.g., 2112). This leaves in_features, out_features, and intermediate_size describing the dense MLP, not the expert dimensions. The quantization pipeline (e.g., imatrix observer, AWQ balancing) reads these attributes to validate shapes; the mismatch will cause incorrect quantization behavior.
Required metadata patch
for i in range(self.num_experts):
gate_up = gate_up_data[i] # [2*intermediate, hidden]
down = down_data[i] # [hidden, intermediate]
@@ -109,9 +109,15 @@ class Gemma4TextExpertsList(torch.nn.ModuleList):
self[i].gate_proj.weight.data = (
gate_up[:intermediate_size, :].clone().contiguous()
)
+ self[i].gate_proj.in_features = config.hidden_size
+ self[i].gate_proj.out_features = intermediate_size
self[i].up_proj.weight.data = (
gate_up[intermediate_size:, :].clone().contiguous()
)
+ self[i].up_proj.in_features = config.hidden_size
+ self[i].up_proj.out_features = intermediate_size
self[i].down_proj.weight.data = down.clone().contiguous()
+ self[i].down_proj.in_features = intermediate_size
+ self[i].down_proj.out_features = config.hidden_size
+ self[i].intermediate_size = intermediate_size📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__(self, config: Gemma4TextConfig, original: Gemma4TextExperts): | |
| from transformers.models.gemma4.modeling_gemma4 import Gemma4TextMLP | |
| self.num_experts = config.num_experts | |
| intermediate_size = config.moe_intermediate_size | |
| with skip_weights_initialize(): | |
| super().__init__( | |
| [Gemma4TextMLP(config, layer_idx=0) for _ in range(self.num_experts)] | |
| ) | |
| gate_up_data = original.gate_up_proj.data # [num_experts, 2*inter, hidden] | |
| down_data = original.down_proj.data # [num_experts, hidden, inter] | |
| for i in range(self.num_experts): | |
| gate_up = gate_up_data[i] # [2*intermediate, hidden] | |
| down = down_data[i] # [hidden, intermediate] | |
| # gate_up_proj stores [gate; up] stacked along dim 0 | |
| # nn.Linear weight is [out_features, in_features] | |
| self[i].gate_proj.weight.data = ( | |
| gate_up[:intermediate_size, :].clone().contiguous() | |
| ) | |
| self[i].up_proj.weight.data = ( | |
| gate_up[intermediate_size:, :].clone().contiguous() | |
| ) | |
| self[i].down_proj.weight.data = down.clone().contiguous() | |
| def __init__(self, config: Gemma4TextConfig, original: Gemma4TextExperts): | |
| from transformers.models.gemma4.modeling_gemma4 import Gemma4TextMLP | |
| self.num_experts = config.num_experts | |
| intermediate_size = config.moe_intermediate_size | |
| with skip_weights_initialize(): | |
| super().__init__( | |
| [Gemma4TextMLP(config, layer_idx=0) for _ in range(self.num_experts)] | |
| ) | |
| gate_up_data = original.gate_up_proj.data # [num_experts, 2*inter, hidden] | |
| down_data = original.down_proj.data # [num_experts, hidden, inter] | |
| for i in range(self.num_experts): | |
| gate_up = gate_up_data[i] # [2*intermediate, hidden] | |
| down = down_data[i] # [hidden, intermediate] | |
| # gate_up_proj stores [gate; up] stacked along dim 0 | |
| # nn.Linear weight is [out_features, in_features] | |
| self[i].gate_proj.weight.data = ( | |
| gate_up[:intermediate_size, :].clone().contiguous() | |
| ) | |
| self[i].gate_proj.in_features = config.hidden_size | |
| self[i].gate_proj.out_features = intermediate_size | |
| self[i].up_proj.weight.data = ( | |
| gate_up[intermediate_size:, :].clone().contiguous() | |
| ) | |
| self[i].up_proj.in_features = config.hidden_size | |
| self[i].up_proj.out_features = intermediate_size | |
| self[i].down_proj.weight.data = down.clone().contiguous() | |
| self[i].down_proj.in_features = intermediate_size | |
| self[i].down_proj.out_features = config.hidden_size | |
| self[i].intermediate_size = intermediate_size |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/llmcompressor/modeling/gemma4.py` around lines 90 - 116, The unpacked
experts' Linear layer metadata must be patched to reflect the expert dimensions
(moe_intermediate_size) instead of the full MLP sizes: after assigning weights
in Gemma4TextExperts.__init__ (inside the loop over self[i] created from
Gemma4TextMLP), set the per-expert attributes so the quantization pipeline sees
the correct shapes — e.g. set self[i].intermediate_size = intermediate_size; set
self[i].gate_proj.in_features = config.hidden_size and
self[i].gate_proj.out_features = intermediate_size; set
self[i].up_proj.in_features = config.hidden_size and
self[i].up_proj.out_features = intermediate_size; set
self[i].down_proj.in_features = intermediate_size and
self[i].down_proj.out_features = config.hidden_size (and any equivalent
.weight.shape-derived metadata) so gate_proj/up_proj/down_proj metadata matches
the copied weights.
| oneshot(model=model, recipe=recipe) | ||
|
|
||
| # Save to disk in compressed-tensors format. | ||
| SAVE_DIR = "/raid/engine/dsikka/" + MODEL_ID.rstrip("/").split("/")[-1] + "-FP8-Dynamic" |
| def __init__( | ||
| self, | ||
| original: Gemma4TextExperts, | ||
| config: Gemma4Config, | ||
| calibrate_all_experts: bool = True, | ||
| ): | ||
| super().__init__() | ||
| self.num_experts = original.num_experts | ||
| self.hidden_dim = original.hidden_dim | ||
| self.intermediate_dim = original.intermediate_dim | ||
| self.calibrate_all_experts = calibrate_all_experts | ||
|
|
||
| # Unpack the 3D expert weights into individual MLP modules | ||
| # Register experts directly as numbered children to avoid double nesting | ||
| # (HF has layers[i].experts, so we want layers[i].experts.0, | ||
| # not layers[i].experts.experts.0) | ||
| expert_list = Gemma4TextExpertsList(config.text_config, original) |
There was a problem hiding this comment.
This is true, GemmaForCausalLM uses a text config
https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma4/modeling_gemma4.py#L1686
| # gate_up_proj stores [gate; up] stacked along dim 0 | ||
| # nn.Linear weight is [out_features, in_features] | ||
| self[i].gate_proj.weight.data = ( | ||
| gate_up[:intermediate_size, :].clone().contiguous() |
There was a problem hiding this comment.
This is an out-of-scope problem, but creating the new Gemma4TextExpertsList class while the original class lives causes a memory spike, which may cause oom for large models (for example, kimi-k2 is 34Gb, doubling that to 68Gb is a large cost).
| # gate_up_proj stores [gate; up] stacked along dim 0 | ||
| # nn.Linear weight is [out_features, in_features] | ||
| self[i].gate_proj.weight.data = ( | ||
| gate_up[:intermediate_size, :].clone().contiguous() |
There was a problem hiding this comment.
Clone and contiguous are redundant
| key: ( | ||
| torch.tensor(value) | ||
| if key != "pixel_values" | ||
| else torch.tensor(value, dtype=torch.bfloat16).squeeze(0) |
There was a problem hiding this comment.
| else torch.tensor(value, dtype=torch.bfloat16).squeeze(0) | |
| else torch.tensor(value, dtype=model.dtype).squeeze(0) |
Code changes was requested by @dsikka. * #2565 (comment) The following files were modified: * `examples/quantization_w8a8_fp8/gemma4_example.py` Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
@coderabbitai review and add labels |
|
The quality checks have failed. Please run |
|
The quality checks have failed. Please run |
|
@coderabbitai Review and label |
|
@coderabbitai review |
|
@coderabbitai full review |
SUMMARY:
Testing:
Summary by CodeRabbit
New Features
Documentation