Conversation
|
lgtm @HenryNdubuaku |
There was a problem hiding this comment.
Pull request overview
This PR fixes Gemma 4 multi-turn multimodal chat behavior, specifically ensuring that audio requested after an image turn is correctly routed to the model (instead of repeatedly biasing toward image description), and improves tokenizer round-trip stability for whitespace-heavy text.
Changes:
- Prevents
chat.cppfrom re-attaching the same image on every subsequent user turn by committing the current image after first use. - Simplifies Gemma4 multimodal decode cache bookkeeping by using KV cache sequence length as the source of truth and only invoking multimodal forward when the delta contains media placeholders.
- Restores missing BPE whitespace-only merges by synthesizing merge rules from vocabulary tokens, improving encode/decode round-trip behavior.
- Adds per-user-turn audio soft-token count caching on the model handle and introduces KV-cache divergence recovery by trimming to the longest common prefix.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/chat.cpp | Avoids re-sending the same image placeholder on every turn by tracking whether the image has been “committed”. |
| cactus/models/gemma4/model_gemma4_mm.cpp | Removes prefill_completed_/last_token_count_ and derives incremental decode behavior from kv_cache_.current_seq_len. |
| cactus/models/gemma4/model_gemma4.h | Drops multimodal decode bookkeeping fields from the model class. |
| cactus/ffi/cactus_utils.h | Extends the model handle with user_audio_counts to persist per-user audio placeholder lengths across turns. |
| cactus/ffi/cactus_complete.cpp | Clears user_audio_counts on cache reset, restores historical audio placeholder counts into prompts, and trims KV cache on prompt/token divergence for audio/mixed-media. |
| cactus/engine/engine_bpe.cpp | Synthesizes whitespace-only merge rules from vocabulary tokens to recover merges that can’t be represented in merges.txt. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
b1e9a80 to
69c7146
Compare
|
Thanks for the review — all three critiques were valid and I pushed fixes (force-pushed, still DCO-signed).
Verified end-to-end with |
- Cache audio soft-token counts per user turn in the handle so prior-turn audio placeholders render consistently without reloading audio. - Simplify Gemma4MmModel::decode_multimodal: use kv_cache_.current_seq_len as the single source of truth, dispatch to forward_multimodal when the delta contains image or audio placeholder tokens. - Recover BPE whitespace-only merges (e.g. "\n" + "\n") from the vocabulary — merges.txt uses \n as its line terminator so it cannot encode these rules, but without them encode(decode(tokens)) != tokens and the multimodal delta computation gets misaligned. - chat.cpp: attach current_image only once per /image, not on every subsequent turn, so the cache isn't restuffed with duplicate vision placeholders (which biased the model toward describing the image). - cactus_complete: trim the cache to the longest common prefix when the new prompt diverges from the cached tokens, so a residual tokenizer roundtrip failure gracefully recovers instead of silently corrupting. - Add tests/test_gemma4_audio_image_audio.cpp reproducing the audio -> image+describe -> audio scenario end to end. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
69c7146 to
16b3aba
Compare
Summary
Fixes multi-turn multimodal chat for Gemma 4 so audio asked after an image turn actually reaches the model (earlier it kept describing the image instead of answering).
What changed
decode_multimodalcleanup. Dropped theprefill_completed_/last_token_count_bookkeeping (which the text/image decode path was silently invalidating). Now useskv_cache_.current_seq_lenas the single source of truth and dispatches toforward_multimodalonly when the delta actually contains image or audio placeholder tokens.merges.txtuses\nas both line terminator and merge content, so merges like\n + \n = \n\ncan't be encoded in that format and were silently dropped. Recover them by scanning the vocabulary after load for pure-whitespace tokens and synthesizing the corresponding merge rules. This restoresencode(decode(tokens)) == tokensfor the common markdown/paragraph cases.current_imagewas being re-attached to every subsequent user turn, creating a duplicate image placeholder block in the cache and biasing the model toward describing the image. Now it's only attached on the first turn after/image.WARNso the remaining roundtrip work remains visible.tests/test_gemma4_audio_image_audio.cppreproduces the bug end-to-end: feedswho_are_you.mp3(audio-only), thenbanner.jpg+"describe this", then2+2.mp3with the image still attached, and asserts turn 3's response differs from turn 2's.Test plan
tests/build/test_gemma4_audio_image_audiopasses (turn 3 answers"2 plus 2 equals 4.")chatmanually: primary audio → image+describe → audio sequence answers"4"3/3 runschatmanually: audio → image+describe → text follow-up still describes correctlychatmanually: audio-only multi-turn unchanged ("Four.")chatmanually:/clearbetween image and audio recovers cleanlyWARNlogRun locally: