Open
Conversation
added 2 commits
April 1, 2026 16:36
…tation Replace the hardcoded IMAGE_EMBED_SIZE constant (1921, Phi3.5-V specific) with dynamic per-model computation: - Add computeImageEmbedSize() that calculates embed size per model type (Phi3-V from crop shape, others from mm_tokens_per_image in model_config) - Add BOI/EOI token wrapping around image embeddings for models that require it (supports both boi_token_index and vision_start_token_id) - Expose model_type and model_config fields in ChatConfig - Make getInputData() async with parallel image dimension preloading - Pass dynamic getImageEmbedSize callback to getChunkedPrefillInputData
Add qwen3_5_v model type handling: - calculateResizeShape: fixed square resize to image_size from model_config - calculateCropShape: single tile (no tiling) - computeImageEmbedSize: (image_size/patch_size/spatial_merge_size)^2 = 196
There was a problem hiding this comment.
Code Review
This pull request introduces dynamic image embedding size calculations to support various vision models (e.g., Phi3-V, Qwen3.5-V) by replacing the hardcoded IMAGE_EMBED_SIZE constant with a model-specific computeImageEmbedSize method. It updates the LLMChatPipeline to preload image dimensions, correctly handles BOI/EOI token wrapping, and refactors getChunkedPrefillInputData to accept a dynamic embedding size function. The reviewer suggested that the getEmbedSize closure within getInputData could be cleaner if refactored into a private method.
Comment on lines
+2074
to
+2080
| const getEmbedSize = (image: ImageURL): number => { | ||
| const dims = imageDimensions.get(image.url); | ||
| if (!dims) { | ||
| throw new Error("InternalError: image dimensions not preloaded"); | ||
| } | ||
| return this.computeImageEmbedSize(dims[0], dims[1]); | ||
| }; |
mlc-llm registers the model as "qwen3_5_vision" but web-llm was checking for "qwen3_5_v", causing resize/crop/embed dispatch to miss and fall through to the mm_tokens_per_image error.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add
qwen3_5_vmodel type handling for vision inference:computeImageEmbedSize:(image_size / patch_size / spatial_merge_size)^2 = 196calculateResizeShape: fixed square resize toimage_sizefrommodel_configcalculateCropShape: single tile (no tiling)Stacked on #804 — merge that first, then this diff will be clean.
See: mlc-ai/mlc-llm#3471