Conversation
There was a problem hiding this comment.
Pull request overview
Updates the Gemma4 vision forward path to reduce memory pressure by avoiding unnecessary padding on the CPU path and by releasing mmapped weight pages during CPU-layer execution.
Changes:
- Avoids padding on the CPU vision path by setting
max_patches = num_realwhen not using the NPU encoder. - Refactors CPU vision encoding to execute layer-by-layer, soft-reset the graph between layers, and release selected weight pages after each layer.
- Releases patch embedding weight pages (
patch_input_proj,position_table) at the end of the CPU path.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for (uint32_t i = 0; i < config_.vision_num_layers; i++) { | ||
| auto [cn, sn] = build_2d_rope_nodes(gb, img, num_real); | ||
| hidden = build_vision_transformer_block(gb, hidden, i, cn, sn, 0, backend); | ||
| gb->execute(); |
There was a problem hiding this comment.
In the CPU path, build_2d_rope_nodes(gb, img, num_real) is called inside the per-layer loop. This recomputes the full RoPE cos/sin tables (pow/sin/cos over all patches) for every transformer layer, which is a significant and avoidable CPU cost, and it also makes the earlier cos_node/sin_node created before the if (can_use_npu_path) branch effectively unused on this path. Consider computing the RoPE tables once outside the loop and reusing them across soft_reset_keep_pool() (e.g., via set_external_input on INPUT nodes or gb->persistent(...)) so resets don't force table regeneration.
| lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj}) | ||
| gb->release_weight_pages(wn); |
There was a problem hiding this comment.
release_weight_pages(...) only releases the large projection matrices, but the layer also uses other mmapped weights (attn_q_norm, attn_k_norm, the various layernorm weights, and optional layer_scalar). If the intent is to reduce RSS/working set between layers, those weights should be included as well; otherwise a substantial portion of the per-layer weight pages may remain resident.
| lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj}) | |
| gb->release_weight_pages(wn); | |
| lw.attn_output_weight, lw.mlp_gate_proj, lw.mlp_up_proj, lw.mlp_down_proj, | |
| lw.attn_q_norm, lw.attn_k_norm, | |
| lw.pre_attn_norm, lw.post_attn_norm, | |
| lw.pre_ffw_norm, lw.post_ffw_norm}) { | |
| gb->release_weight_pages(wn); | |
| } | |
| if (lw.layer_scalar != 0) | |
| gb->release_weight_pages(lw.layer_scalar); |
| gb->release_weight_pages(vision_weights_.patch_input_proj); | ||
| gb->release_weight_pages(vision_weights_.position_table); |
There was a problem hiding this comment.
patch_input_proj and position_table are only needed to build the initial patch embedding; by the time the first layer has executed and you call soft_reset_keep_pool(), they are no longer referenced by the new per-layer graphs. To maximize memory savings during the remaining layers, consider releasing these pages immediately after the first gb->execute() (or before entering later iterations) rather than only after the full loop completes.
No description provided.