gemma4 vision memory optimizations by jakmro · Pull Request #590 · cactus-compute/cactus

jakmro · 2026-04-16T00:00:23Z

No description provided.

Copilot

Pull request overview

Updates the Gemma4 vision forward path to reduce memory pressure by avoiding unnecessary padding on the CPU path and by releasing mmapped weight pages during CPU-layer execution.

Changes:

Avoids padding on the CPU vision path by setting max_patches = num_real when not using the NPU encoder.
Refactors CPU vision encoding to execute layer-by-layer, soft-reset the graph between layers, and release selected weight pages after each layer.
Releases patch embedding weight pages (patch_input_proj, position_table) at the end of the CPU path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-16T10:47:23Z

+        for (uint32_t i = 0; i < config_.vision_num_layers; i++) {
+            auto [cn, sn] = build_2d_rope_nodes(gb, img, num_real);
+            hidden = build_vision_transformer_block(gb, hidden, i, cn, sn, 0, backend);
+            gb->execute();


In the CPU path, build_2d_rope_nodes(gb, img, num_real) is called inside the per-layer loop. This recomputes the full RoPE cos/sin tables (pow/sin/cos over all patches) for every transformer layer, which is a significant and avoidable CPU cost, and it also makes the earlier cos_node/sin_node created before the if (can_use_npu_path) branch effectively unused on this path. Consider computing the RoPE tables once outside the loop and reusing them across soft_reset_keep_pool() (e.g., via set_external_input on INPUT nodes or gb->persistent(...)) so resets don't force table regeneration.

Copilot · 2026-04-16T10:47:23Z

+                              lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj})
+                gb->release_weight_pages(wn);


release_weight_pages(...) only releases the large projection matrices, but the layer also uses other mmapped weights (attn_q_norm, attn_k_norm, the various layernorm weights, and optional layer_scalar). If the intent is to reduce RSS/working set between layers, those weights should be included as well; otherwise a substantial portion of the per-layer weight pages may remain resident.

Suggested change

lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj})

gb->release_weight_pages(wn);

lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj,

lw.attn_q_norm, lw.attn_k_norm,

lw.pre_attn_norm, lw.post_attn_norm,

lw.pre_ffw_norm, lw.post_ffw_norm}) {

gb->release_weight_pages(wn);

}

if (lw.layer_scalar != 0)

gb->release_weight_pages(lw.layer_scalar);

Copilot · 2026-04-16T10:47:24Z

+        gb->release_weight_pages(vision_weights_.patch_input_proj);
+        gb->release_weight_pages(vision_weights_.position_table);


patch_input_proj and position_table are only needed to build the initial patch embedding; by the time the first layer has executed and you call soft_reset_keep_pool(), they are no longer referenced by the new per-layer graphs. To maximize memory savings during the remaining layers, consider releasing these pages immediately after the first gb->execute() (or before entering later iterations) rather than only after the full loop completes.

optimize weight management

560cba0

jakmro marked this pull request as ready for review April 16, 2026 10:41

Copilot AI review requested due to automatic review settings April 16, 2026 10:41

Copilot started reviewing on behalf of jakmro April 16, 2026 10:42 View session

jakmro changed the title ~~optimize weight management~~ gemma4 vision memory optimizations Apr 16, 2026

Copilot AI reviewed Apr 16, 2026

View reviewed changes

HenryNdubuaku merged commit 1d51143 into main Apr 16, 2026
10 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemma4 vision memory optimizations#590

gemma4 vision memory optimizations#590
HenryNdubuaku merged 1 commit intomainfrom
gemma4-memory-optimizations

jakmro commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Copilot AI Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj})
		gb->release_weight_pages(wn);

-                              lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj})
-                gb->release_weight_pages(wn);
+                              lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj,
+                              lw.attn_q_norm, lw.attn_k_norm,
+                              lw.pre_attn_norm, lw.post_attn_norm,
+                              lw.pre_ffw_norm, lw.post_ffw_norm}) {
+                gb->release_weight_pages(wn);
+            }
+            if (lw.layer_scalar != 0)
+                gb->release_weight_pages(lw.layer_scalar);

		gb->release_weight_pages(vision_weights_.patch_input_proj);
		gb->release_weight_pages(vision_weights_.position_table);

Conversation

jakmro commented Apr 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants