[VLM] Add Qwen3.5 Vision support by gnguralnick · Pull Request #806 · mlc-ai/web-llm

gnguralnick · 2026-04-01T23:50:09Z

Add qwen3_5_v model type handling for vision inference:

computeImageEmbedSize: (image_size / patch_size / spatial_merge_size)^2 = 196
calculateResizeShape: fixed square resize to image_size from model_config
calculateCropShape: single tile (no tiling)

Stacked on #804 — merge that first, then this diff will be clean.

…tation Replace the hardcoded IMAGE_EMBED_SIZE constant (1921, Phi3.5-V specific) with dynamic per-model computation: - Add computeImageEmbedSize() that calculates embed size per model type (Phi3-V from crop shape, others from mm_tokens_per_image in model_config) - Add BOI/EOI token wrapping around image embeddings for models that require it (supports both boi_token_index and vision_start_token_id) - Expose model_type and model_config fields in ChatConfig - Make getInputData() async with parallel image dimension preloading - Pass dynamic getImageEmbedSize callback to getChunkedPrefillInputData

Add qwen3_5_v model type handling: - calculateResizeShape: fixed square resize to image_size from model_config - calculateCropShape: single tile (no tiling) - computeImageEmbedSize: (image_size/patch_size/spatial_merge_size)^2 = 196

gemini-code-assist

Code Review

This pull request introduces dynamic image embedding size calculations to support various vision models (e.g., Phi3-V, Qwen3.5-V) by replacing the hardcoded IMAGE_EMBED_SIZE constant with a model-specific computeImageEmbedSize method. It updates the LLMChatPipeline to preload image dimensions, correctly handles BOI/EOI token wrapping, and refactors getChunkedPrefillInputData to accept a dynamic embedding size function. The reviewer suggested that the getEmbedSize closure within getInputData could be cleaner if refactored into a private method.

gemini-code-assist · 2026-04-01T23:53:35Z

src/llm_chat.ts

+    const getEmbedSize = (image: ImageURL): number => {
+      const dims = imageDimensions.get(image.url);
+      if (!dims) {
+        throw new Error("InternalError: image dimensions not preloaded");
+      }
+      return this.computeImageEmbedSize(dims[0], dims[1]);
+    };


The getEmbedSize function is defined inside getInputData and captures imageDimensions. It would be cleaner to define this as a private method or ensure the closure is necessary. Given the complexity, consider if this logic can be simplified.

mlc-llm registers the model as "qwen3_5_vision" but web-llm was checking for "qwen3_5_v", causing resize/crop/embed dispatch to miss and fall through to the mm_tokens_per_image error.

Gabriel Guralnick added 2 commits April 1, 2026 16:36

gemini-code-assist bot reviewed Apr 1, 2026

View reviewed changes

gnguralnick mentioned this pull request Apr 1, 2026

[VLM] Dynamic image embed size and Gemma 3 Vision support #774

Closed

gnguralnick marked this pull request as ready for review April 1, 2026 23:56

Fix Qwen3.5 Vision model type mismatch: qwen3_5_v → qwen3_5_vision

1be76d3

mlc-llm registers the model as "qwen3_5_vision" but web-llm was checking for "qwen3_5_v", causing resize/crop/embed dispatch to miss and fall through to the mm_tokens_per_image error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Add Qwen3.5 Vision support#806

[VLM] Add Qwen3.5 Vision support#806
gnguralnick wants to merge 3 commits intomlc-ai:mainfrom
gnguralnick:qwen35v-vision

gnguralnick commented Apr 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gnguralnick commented Apr 1, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant