Add GLM-4 reasoning parser and fix think tag / prefix cache bugs by janhilgard · Pull Request #295 · waybarrios/vllm-mlx

janhilgard · 2026-04-12T20:52:26Z

Summary

Add Glm4ReasoningParser (--reasoning-parser glm4) for GLM-4 models (GLM-4.5-Air, GLM-4.7) that use <think>...</think> tags but don't inject <think> in the prompt. Unlike Qwen3 where no tags = truncated reasoning, GLM-4 no tags = normal content. Streaming override emits pre-think deltas as content.
Fix duplicate </think> tag in BaseThinkingReasoningParser: some models generate <think></think></think> with an extra closing tag that leaked into content.
Fix reasoning extraction with tool calls: always parse reasoning from original output.text so <think> content is preserved even when tool calls are present.
Remove <tool_call> from SPECIAL_TOKENS_PATTERN: these are structural tags used by GLM-4's tool calling format (--tool-call-parser glm47) and must not be stripped.
Fix prefix cache corruption on exact match: trim stored cache by 1 in _prompt_cache_save so the last prompt token is reprocessed at the correct position on cache hit.

Test plan

Basic text generation (reasoning + content correctly separated)
Tool calling with reasoning (tool_calls + reasoning_content in response)
Streaming (reasoning chunks → content chunks transition)
Prefix cache exact hit (identical output on repeated prompts)
KV cache quantization Q8 (no degradation)
Verified Qwen3.5-122B (MLLM path) unaffected by scheduler.py fix

Tested with GLM-4.5-Air (106B/12B MoE, mixed 6/8-bit) on Apple M3 Ultra, ~37 tok/s.

🤖 Generated with Claude Code

- Add Glm4ReasoningParser for GLM-4 models (GLM-4.5-Air, GLM-4.7) that use <think>...</think> tags but don't inject <think> in the prompt — output without tags is normal content, not truncated reasoning. Streaming override emits pre-think deltas as content. - Fix duplicate </think> tag handling in BaseThinkingReasoningParser: some models (e.g. GLM-4.5-Air) generate <think></think></think> with an extra closing tag that leaked into content. - Fix reasoning extraction with tool calls in server.py: always parse reasoning from original output.text so <think> content is preserved even when tool calls are present. - Remove <tool_call> tags from SPECIAL_TOKENS_PATTERN in utils.py: these are structural tags used by GLM-4's tool calling format and must not be stripped as special tokens. - Fix prefix cache corruption on exact match in scheduler.py: trim stored cache by 1 so the last prompt token is reprocessed at the correct position on cache hit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

janhilgard force-pushed the feat/glm4-reasoning-parser branch from 7b05a96 to c32db4e Compare April 12, 2026 20:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GLM-4 reasoning parser and fix think tag / prefix cache bugs#295

Add GLM-4 reasoning parser and fix think tag / prefix cache bugs#295
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/glm4-reasoning-parser

janhilgard commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

janhilgard commented Apr 12, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant