Skip to content

Gemma4 multimodal multi-turn fixes#598

Merged
HenryNdubuaku merged 1 commit intomainfrom
multi-modal-turn
Apr 21, 2026
Merged

Gemma4 multimodal multi-turn fixes#598
HenryNdubuaku merged 1 commit intomainfrom
multi-modal-turn

Conversation

@ncylich
Copy link
Copy Markdown
Collaborator

@ncylich ncylich commented Apr 18, 2026

Summary

Fixes multi-turn multimodal chat for Gemma 4 so audio asked after an image turn actually reaches the model (earlier it kept describing the image instead of answering).

What changed

  • Audio placeholder restoration. Cache each user turn's audio soft-token count on the handle; the template re-emits the same placeholder block on subsequent turns without reloading or re-encoding prior audio.
  • decode_multimodal cleanup. Dropped the prefill_completed_ / last_token_count_ bookkeeping (which the text/image decode path was silently invalidating). Now uses kv_cache_.current_seq_len as the single source of truth and dispatches to forward_multimodal only when the delta actually contains image or audio placeholder tokens.
  • BPE whitespace-only merges. merges.txt uses \n as both line terminator and merge content, so merges like \n + \n = \n\n can't be encoded in that format and were silently dropped. Recover them by scanning the vocabulary after load for pure-whitespace tokens and synthesizing the corresponding merge rules. This restores encode(decode(tokens)) == tokens for the common markdown/paragraph cases.
  • chat.cpp image behaviour. The persistent current_image was being re-attached to every subsequent user turn, creating a duplicate image placeholder block in the cache and biasing the model toward describing the image. Now it's only attached on the first turn after /image.
  • Cache divergence recovery. For any residual tokenizer roundtrip gap (e.g. code blocks with mixed indent+newline tokens), trim the cache to the longest common prefix with the new prompt and re-prefill the divergent suffix — logs a WARN so the remaining roundtrip work remains visible.
  • Test. tests/test_gemma4_audio_image_audio.cpp reproduces the bug end-to-end: feeds who_are_you.mp3 (audio-only), then banner.jpg + "describe this", then 2+2.mp3 with the image still attached, and asserts turn 3's response differs from turn 2's.

Test plan

  • tests/build/test_gemma4_audio_image_audio passes (turn 3 answers "2 plus 2 equals 4.")
  • chat manually: primary audio → image+describe → audio sequence answers "4" 3/3 runs
  • chat manually: audio → image+describe → text follow-up still describes correctly
  • chat manually: audio-only multi-turn unchanged ("Four.")
  • chat manually: /clear between image and audio recovers cleanly
  • Stress: 4 turns alternating audio / image+text / audio / audio, no regressions
  • Code-example prompt (previously tripped a hard error) now recovers via partial cache trim + WARN log

Run locally:

CACTUS_TEST_GEMMA4_MODEL=/path/to/gemma-4-e2b-it \
CACTUS_TEST_REPO_ROOT=/path/to/repo \
./tests/build/test_gemma4_audio_image_audio

Copilot AI review requested due to automatic review settings April 18, 2026 22:49
@kar-m
Copy link
Copy Markdown
Collaborator

kar-m commented Apr 18, 2026

lgtm @HenryNdubuaku

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes Gemma 4 multi-turn multimodal chat behavior, specifically ensuring that audio requested after an image turn is correctly routed to the model (instead of repeatedly biasing toward image description), and improves tokenizer round-trip stability for whitespace-heavy text.

Changes:

  • Prevents chat.cpp from re-attaching the same image on every subsequent user turn by committing the current image after first use.
  • Simplifies Gemma4 multimodal decode cache bookkeeping by using KV cache sequence length as the source of truth and only invoking multimodal forward when the delta contains media placeholders.
  • Restores missing BPE whitespace-only merges by synthesizing merge rules from vocabulary tokens, improving encode/decode round-trip behavior.
  • Adds per-user-turn audio soft-token count caching on the model handle and introduces KV-cache divergence recovery by trimming to the longest common prefix.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/chat.cpp Avoids re-sending the same image placeholder on every turn by tracking whether the image has been “committed”.
cactus/models/gemma4/model_gemma4_mm.cpp Removes prefill_completed_/last_token_count_ and derives incremental decode behavior from kv_cache_.current_seq_len.
cactus/models/gemma4/model_gemma4.h Drops multimodal decode bookkeeping fields from the model class.
cactus/ffi/cactus_utils.h Extends the model handle with user_audio_counts to persist per-user audio placeholder lengths across turns.
cactus/ffi/cactus_complete.cpp Clears user_audio_counts on cache reset, restores historical audio placeholder counts into prompts, and trims KV cache on prompt/token divergence for audio/mixed-media.
cactus/engine/engine_bpe.cpp Synthesizes whitespace-only merge rules from vocabulary tokens to recover merges that can’t be represented in merges.txt.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/chat.cpp Outdated
Comment thread cactus/ffi/cactus_complete.cpp Outdated
Comment thread cactus/ffi/cactus_complete.cpp
@ncylich
Copy link
Copy Markdown
Collaborator Author

ncylich commented Apr 18, 2026

Thanks for the review — all three critiques were valid and I pushed fixes (force-pushed, still DCO-signed).

  1. /image <same_path> not reattaching. Dropped the path-equality guard — image_committed is now unconditionally reset on any /image command, so retyping /image banner.jpg attaches again on the next turn.

  2. KV cache trim off-by-one. You're right that handle->processed_tokens is usually one ahead of kv_cache_.current_seq_len (the last sampled token isn't forwarded yet), so remove_token_range(common, processed_tokens.size() - common) was hitting the start+count > current_seq_len early-exit in engine_cache.cpp:479 and silently no-op'ing. Added a small public Model::get_cache_size() accessor and clamp the removal count to kv_len - common. Verified: on the code-block stress case the WARN now fires and the actual trim lands, so the model reaches the audio question correctly without relying on decode_multimodal's secondary reset.

  3. cactus_reset() missing user_audio_counts clear. Fixed in cactus_init.cppcactus_reset() now clears it alongside processed_tokens / processed_images.

Verified end-to-end with tests/test_gemma4_audio_image_audio plus chat manual runs (primary, /clear, /image retype, reset, code-example stress).

- Cache audio soft-token counts per user turn in the handle so prior-turn
  audio placeholders render consistently without reloading audio.
- Simplify Gemma4MmModel::decode_multimodal: use kv_cache_.current_seq_len
  as the single source of truth, dispatch to forward_multimodal when the
  delta contains image or audio placeholder tokens.
- Recover BPE whitespace-only merges (e.g. "\n" + "\n") from the
  vocabulary — merges.txt uses \n as its line terminator so it cannot
  encode these rules, but without them encode(decode(tokens)) != tokens
  and the multimodal delta computation gets misaligned.
- chat.cpp: attach current_image only once per /image, not on every
  subsequent turn, so the cache isn't restuffed with duplicate vision
  placeholders (which biased the model toward describing the image).
- cactus_complete: trim the cache to the longest common prefix when the
  new prompt diverges from the cached tokens, so a residual tokenizer
  roundtrip failure gracefully recovers instead of silently corrupting.
- Add tests/test_gemma4_audio_image_audio.cpp reproducing the
  audio -> image+describe -> audio scenario end to end.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
@HenryNdubuaku HenryNdubuaku merged commit 6f1c63d into main Apr 21, 2026
3 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants