Skip to content

gemma4 vision memory optimizations#590

Merged
HenryNdubuaku merged 1 commit intomainfrom
gemma4-memory-optimizations
Apr 16, 2026
Merged

gemma4 vision memory optimizations#590
HenryNdubuaku merged 1 commit intomainfrom
gemma4-memory-optimizations

Conversation

@jakmro
Copy link
Copy Markdown
Collaborator

@jakmro jakmro commented Apr 16, 2026

No description provided.

@jakmro jakmro marked this pull request as ready for review April 16, 2026 10:41
Copilot AI review requested due to automatic review settings April 16, 2026 10:41
@jakmro jakmro changed the title optimize weight management gemma4 vision memory optimizations Apr 16, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Gemma4 vision forward path to reduce memory pressure by avoiding unnecessary padding on the CPU path and by releasing mmapped weight pages during CPU-layer execution.

Changes:

  • Avoids padding on the CPU vision path by setting max_patches = num_real when not using the NPU encoder.
  • Refactors CPU vision encoding to execute layer-by-layer, soft-reset the graph between layers, and release selected weight pages after each layer.
  • Releases patch embedding weight pages (patch_input_proj, position_table) at the end of the CPU path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +527 to +530
for (uint32_t i = 0; i < config_.vision_num_layers; i++) {
auto [cn, sn] = build_2d_rope_nodes(gb, img, num_real);
hidden = build_vision_transformer_block(gb, hidden, i, cn, sn, 0, backend);
gb->execute();
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the CPU path, build_2d_rope_nodes(gb, img, num_real) is called inside the per-layer loop. This recomputes the full RoPE cos/sin tables (pow/sin/cos over all patches) for every transformer layer, which is a significant and avoidable CPU cost, and it also makes the earlier cos_node/sin_node created before the if (can_use_npu_path) branch effectively unused on this path. Consider computing the RoPE tables once outside the loop and reusing them across soft_reset_keep_pool() (e.g., via set_external_input on INPUT nodes or gb->persistent(...)) so resets don't force table regeneration.

Copilot uses AI. Check for mistakes.
Comment on lines +540 to +541
lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj})
gb->release_weight_pages(wn);
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

release_weight_pages(...) only releases the large projection matrices, but the layer also uses other mmapped weights (attn_q_norm, attn_k_norm, the various layernorm weights, and optional layer_scalar). If the intent is to reduce RSS/working set between layers, those weights should be included as well; otherwise a substantial portion of the per-layer weight pages may remain resident.

Suggested change
lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj})
gb->release_weight_pages(wn);
lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj,
lw.attn_q_norm, lw.attn_k_norm,
lw.pre_attn_norm, lw.post_attn_norm,
lw.pre_ffw_norm, lw.post_ffw_norm}) {
gb->release_weight_pages(wn);
}
if (lw.layer_scalar != 0)
gb->release_weight_pages(lw.layer_scalar);

Copilot uses AI. Check for mistakes.
Comment on lines +547 to +548
gb->release_weight_pages(vision_weights_.patch_input_proj);
gb->release_weight_pages(vision_weights_.position_table);
Copy link

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

patch_input_proj and position_table are only needed to build the initial patch embedding; by the time the first layer has executed and you call soft_reset_keep_pool(), they are no longer referenced by the new per-layer graphs. To maximize memory savings during the remaining layers, consider releasing these pages immediately after the first gb->execute() (or before entering later iterations) rather than only after the full loop completes.

Copilot uses AI. Check for mistakes.
@HenryNdubuaku HenryNdubuaku merged commit 1d51143 into main Apr 16, 2026
10 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants