Skip to content

Add GLM-4 reasoning parser and fix think tag / prefix cache bugs#295

Draft
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/glm4-reasoning-parser
Draft

Add GLM-4 reasoning parser and fix think tag / prefix cache bugs#295
janhilgard wants to merge 1 commit intowaybarrios:mainfrom
janhilgard:feat/glm4-reasoning-parser

Conversation

@janhilgard
Copy link
Copy Markdown
Collaborator

Summary

  • Add Glm4ReasoningParser (--reasoning-parser glm4) for GLM-4 models (GLM-4.5-Air, GLM-4.7) that use <think>...</think> tags but don't inject <think> in the prompt. Unlike Qwen3 where no tags = truncated reasoning, GLM-4 no tags = normal content. Streaming override emits pre-think deltas as content.
  • Fix duplicate </think> tag in BaseThinkingReasoningParser: some models generate <think></think></think> with an extra closing tag that leaked into content.
  • Fix reasoning extraction with tool calls: always parse reasoning from original output.text so <think> content is preserved even when tool calls are present.
  • Remove <tool_call> from SPECIAL_TOKENS_PATTERN: these are structural tags used by GLM-4's tool calling format (--tool-call-parser glm47) and must not be stripped.
  • Fix prefix cache corruption on exact match: trim stored cache by 1 in _prompt_cache_save so the last prompt token is reprocessed at the correct position on cache hit.

Test plan

  • Basic text generation (reasoning + content correctly separated)
  • Tool calling with reasoning (tool_calls + reasoning_content in response)
  • Streaming (reasoning chunks → content chunks transition)
  • Prefix cache exact hit (identical output on repeated prompts)
  • KV cache quantization Q8 (no degradation)
  • Verified Qwen3.5-122B (MLLM path) unaffected by scheduler.py fix

Tested with GLM-4.5-Air (106B/12B MoE, mixed 6/8-bit) on Apple M3 Ultra, ~37 tok/s.

🤖 Generated with Claude Code

- Add Glm4ReasoningParser for GLM-4 models (GLM-4.5-Air, GLM-4.7)
  that use <think>...</think> tags but don't inject <think> in the
  prompt — output without tags is normal content, not truncated
  reasoning. Streaming override emits pre-think deltas as content.

- Fix duplicate </think> tag handling in BaseThinkingReasoningParser:
  some models (e.g. GLM-4.5-Air) generate <think></think></think>
  with an extra closing tag that leaked into content.

- Fix reasoning extraction with tool calls in server.py: always
  parse reasoning from original output.text so <think> content is
  preserved even when tool calls are present.

- Remove <tool_call> tags from SPECIAL_TOKENS_PATTERN in utils.py:
  these are structural tags used by GLM-4's tool calling format
  and must not be stripped as special tokens.

- Fix prefix cache corruption on exact match in scheduler.py:
  trim stored cache by 1 so the last prompt token is reprocessed
  at the correct position on cache hit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@janhilgard janhilgard force-pushed the feat/glm4-reasoning-parser branch from 7b05a96 to c32db4e Compare April 12, 2026 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant