Skip to content

[VLM] Add Qwen3.5 Vision support#806

Open
gnguralnick wants to merge 3 commits intomlc-ai:mainfrom
gnguralnick:qwen35v-vision
Open

[VLM] Add Qwen3.5 Vision support#806
gnguralnick wants to merge 3 commits intomlc-ai:mainfrom
gnguralnick:qwen35v-vision

Conversation

@gnguralnick
Copy link
Copy Markdown

Add qwen3_5_v model type handling for vision inference:

  • computeImageEmbedSize: (image_size / patch_size / spatial_merge_size)^2 = 196
  • calculateResizeShape: fixed square resize to image_size from model_config
  • calculateCropShape: single tile (no tiling)

Stacked on #804 — merge that first, then this diff will be clean.

See: mlc-ai/mlc-llm#3471

Gabriel Guralnick added 2 commits April 1, 2026 16:36
…tation

Replace the hardcoded IMAGE_EMBED_SIZE constant (1921, Phi3.5-V specific)
with dynamic per-model computation:

- Add computeImageEmbedSize() that calculates embed size per model type
  (Phi3-V from crop shape, others from mm_tokens_per_image in model_config)
- Add BOI/EOI token wrapping around image embeddings for models that
  require it (supports both boi_token_index and vision_start_token_id)
- Expose model_type and model_config fields in ChatConfig
- Make getInputData() async with parallel image dimension preloading
- Pass dynamic getImageEmbedSize callback to getChunkedPrefillInputData
Add qwen3_5_v model type handling:
- calculateResizeShape: fixed square resize to image_size from model_config
- calculateCropShape: single tile (no tiling)
- computeImageEmbedSize: (image_size/patch_size/spatial_merge_size)^2 = 196
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dynamic image embedding size calculations to support various vision models (e.g., Phi3-V, Qwen3.5-V) by replacing the hardcoded IMAGE_EMBED_SIZE constant with a model-specific computeImageEmbedSize method. It updates the LLMChatPipeline to preload image dimensions, correctly handles BOI/EOI token wrapping, and refactors getChunkedPrefillInputData to accept a dynamic embedding size function. The reviewer suggested that the getEmbedSize closure within getInputData could be cleaner if refactored into a private method.

Comment on lines +2074 to +2080
const getEmbedSize = (image: ImageURL): number => {
const dims = imageDimensions.get(image.url);
if (!dims) {
throw new Error("InternalError: image dimensions not preloaded");
}
return this.computeImageEmbedSize(dims[0], dims[1]);
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The getEmbedSize function is defined inside getInputData and captures imageDimensions. It would be cleaner to define this as a private method or ensure the closure is necessary. Given the complexity, consider if this logic can be simplified.

@gnguralnick gnguralnick marked this pull request as ready for review April 1, 2026 23:56
mlc-llm registers the model as "qwen3_5_vision" but web-llm was
checking for "qwen3_5_v", causing resize/crop/embed dispatch to miss
and fall through to the mm_tokens_per_image error.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant