Gemma4 multimodal multi-turn fixes by ncylich · Pull Request #598 · cactus-compute/cactus

ncylich · 2026-04-18T22:49:56Z

Summary

Fixes multi-turn multimodal chat for Gemma 4 so audio asked after an image turn actually reaches the model (earlier it kept describing the image instead of answering).

What changed

Audio placeholder restoration. Cache each user turn's audio soft-token count on the handle; the template re-emits the same placeholder block on subsequent turns without reloading or re-encoding prior audio.
decode_multimodal cleanup. Dropped the prefill_completed_ / last_token_count_ bookkeeping (which the text/image decode path was silently invalidating). Now uses kv_cache_.current_seq_len as the single source of truth and dispatches to forward_multimodal only when the delta actually contains image or audio placeholder tokens.
BPE whitespace-only merges. merges.txt uses \n as both line terminator and merge content, so merges like \n + \n = \n\n can't be encoded in that format and were silently dropped. Recover them by scanning the vocabulary after load for pure-whitespace tokens and synthesizing the corresponding merge rules. This restores encode(decode(tokens)) == tokens for the common markdown/paragraph cases.
chat.cpp image behaviour. The persistent current_image was being re-attached to every subsequent user turn, creating a duplicate image placeholder block in the cache and biasing the model toward describing the image. Now it's only attached on the first turn after /image.
Cache divergence recovery. For any residual tokenizer roundtrip gap (e.g. code blocks with mixed indent+newline tokens), trim the cache to the longest common prefix with the new prompt and re-prefill the divergent suffix — logs a WARN so the remaining roundtrip work remains visible.
Test. tests/test_gemma4_audio_image_audio.cpp reproduces the bug end-to-end: feeds who_are_you.mp3 (audio-only), then banner.jpg + "describe this", then 2+2.mp3 with the image still attached, and asserts turn 3's response differs from turn 2's.

Test plan

tests/build/test_gemma4_audio_image_audio passes (turn 3 answers "2 plus 2 equals 4.")
chat manually: primary audio → image+describe → audio sequence answers "4" 3/3 runs
chat manually: audio → image+describe → text follow-up still describes correctly
chat manually: audio-only multi-turn unchanged ("Four.")
chat manually: /clear between image and audio recovers cleanly
Stress: 4 turns alternating audio / image+text / audio / audio, no regressions
Code-example prompt (previously tripped a hard error) now recovers via partial cache trim + WARN log

Run locally:

CACTUS_TEST_GEMMA4_MODEL=/path/to/gemma-4-e2b-it \
CACTUS_TEST_REPO_ROOT=/path/to/repo \
./tests/build/test_gemma4_audio_image_audio

kar-m · 2026-04-18T22:54:57Z

lgtm @HenryNdubuaku

Copilot

Pull request overview

This PR fixes Gemma 4 multi-turn multimodal chat behavior, specifically ensuring that audio requested after an image turn is correctly routed to the model (instead of repeatedly biasing toward image description), and improves tokenizer round-trip stability for whitespace-heavy text.

Changes:

Prevents chat.cpp from re-attaching the same image on every subsequent user turn by committing the current image after first use.
Simplifies Gemma4 multimodal decode cache bookkeeping by using KV cache sequence length as the source of truth and only invoking multimodal forward when the delta contains media placeholders.
Restores missing BPE whitespace-only merges by synthesizing merge rules from vocabulary tokens, improving encode/decode round-trip behavior.
Adds per-user-turn audio soft-token count caching on the model handle and introduces KV-cache divergence recovery by trimming to the longest common prefix.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/chat.cpp	Avoids re-sending the same image placeholder on every turn by tracking whether the image has been “committed”.
cactus/models/gemma4/model_gemma4_mm.cpp	Removes `prefill_completed_`/`last_token_count_` and derives incremental decode behavior from `kv_cache_.current_seq_len`.
cactus/models/gemma4/model_gemma4.h	Drops multimodal decode bookkeeping fields from the model class.
cactus/ffi/cactus_utils.h	Extends the model handle with `user_audio_counts` to persist per-user audio placeholder lengths across turns.
cactus/ffi/cactus_complete.cpp	Clears `user_audio_counts` on cache reset, restores historical audio placeholder counts into prompts, and trims KV cache on prompt/token divergence for audio/mixed-media.
cactus/engine/engine_bpe.cpp	Synthesizes whitespace-only merge rules from vocabulary tokens to recover merges that can’t be represented in `merges.txt`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ncylich · 2026-04-18T23:06:07Z

Thanks for the review — all three critiques were valid and I pushed fixes (force-pushed, still DCO-signed).

/image <same_path> not reattaching. Dropped the path-equality guard — image_committed is now unconditionally reset on any /image command, so retyping /image banner.jpg attaches again on the next turn.
KV cache trim off-by-one. You're right that handle->processed_tokens is usually one ahead of kv_cache_.current_seq_len (the last sampled token isn't forwarded yet), so remove_token_range(common, processed_tokens.size() - common) was hitting the start+count > current_seq_len early-exit in engine_cache.cpp:479 and silently no-op'ing. Added a small public Model::get_cache_size() accessor and clamp the removal count to kv_len - common. Verified: on the code-block stress case the WARN now fires and the actual trim lands, so the model reaches the audio question correctly without relying on decode_multimodal's secondary reset.
cactus_reset() missing user_audio_counts clear. Fixed in cactus_init.cpp — cactus_reset() now clears it alongside processed_tokens / processed_images.

Verified end-to-end with tests/test_gemma4_audio_image_audio plus chat manual runs (primary, /clear, /image retype, reset, code-example stress).

- Cache audio soft-token counts per user turn in the handle so prior-turn audio placeholders render consistently without reloading audio. - Simplify Gemma4MmModel::decode_multimodal: use kv_cache_.current_seq_len as the single source of truth, dispatch to forward_multimodal when the delta contains image or audio placeholder tokens. - Recover BPE whitespace-only merges (e.g. "\n" + "\n") from the vocabulary — merges.txt uses \n as its line terminator so it cannot encode these rules, but without them encode(decode(tokens)) != tokens and the multimodal delta computation gets misaligned. - chat.cpp: attach current_image only once per /image, not on every subsequent turn, so the cache isn't restuffed with duplicate vision placeholders (which biased the model toward describing the image). - cactus_complete: trim the cache to the longest common prefix when the new prompt diverges from the cached tokens, so a residual tokenizer roundtrip failure gracefully recovers instead of silently corrupting. - Add tests/test_gemma4_audio_image_audio.cpp reproducing the audio -> image+describe -> audio scenario end to end. Signed-off-by: Noah Cylich <noahcylich@gmail.com>

Copilot AI review requested due to automatic review settings April 18, 2026 22:49

Copilot started reviewing on behalf of ncylich April 18, 2026 22:50 View session

Copilot AI reviewed Apr 18, 2026

View reviewed changes

Comment thread tests/chat.cpp Outdated

Comment thread cactus/ffi/cactus_complete.cpp Outdated

Comment thread cactus/ffi/cactus_complete.cpp

ncylich force-pushed the multi-modal-turn branch from b1e9a80 to 69c7146 Compare April 18, 2026 23:05

ncylich force-pushed the multi-modal-turn branch from 69c7146 to 16b3aba Compare April 19, 2026 01:35

HenryNdubuaku merged commit 6f1c63d into main Apr 21, 2026
3 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4 multimodal multi-turn fixes#598

Gemma4 multimodal multi-turn fixes#598
HenryNdubuaku merged 1 commit intomainfrom
multi-modal-turn

ncylich commented Apr 18, 2026

Uh oh!

kar-m commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ncylich commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ncylich commented Apr 18, 2026

Summary

What changed

Test plan

Uh oh!

kar-m commented Apr 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ncylich commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants