Skip to content

fix: overhaul GLM-4.7-Flash streaming tool calls and add GLM4 reasoning parser#246

Open
b2ornot2b wants to merge 4 commits intowaybarrios:mainfrom
b2ornot2b:fix/glm4-streaming-tool-calls
Open

fix: overhaul GLM-4.7-Flash streaming tool calls and add GLM4 reasoning parser#246
b2ornot2b wants to merge 4 commits intowaybarrios:mainfrom
b2ornot2b:fix/glm4-streaming-tool-calls

Conversation

@b2ornot2b
Copy link
Copy Markdown

Summary

This PR fixes the broken streaming tool-calling implementation for GLM-4.7-Flash and introduces a dedicated GLM4ReasoningParser to handle the model's reasoning/thinking behavior.

The Problem

The previous implementation failed in streaming mode due to:

  1. Server Loop Skipping: The streaming loop in server.py had a greedy continue that skipped tool parsing whenever the reasoning parser returned None (e.g., during transitions).
  2. Token Fragmentation: Fragile string matching ("</tool_call>" in delta_text) failed when BPE tokens arrived in separate chunks.
  3. State Management: Lack of persistent state in the tool parser caused loss of metadata (id, type, name) in final chunks.

The Solution

  1. Server Overhaul (server.py): Removed the greedy continue and implemented strict fall-through logic. The tool parser is now always evaluated, ensuring transitions from reasoning to tool calling are captured.
  2. State-Machine Tool Parser (glm47_tool_parser.py): Replaced string matching with an index-driven state machine. It tracks last_parsed_index and maintains explicit states (PARSING_NAME, PARSING_ARGUMENTS), allowing it to handle fragmented tokens and stream arguments incrementally.
  3. GLM4 Reasoning Parser: Added a dedicated parser to correctly handle GLM-4's specific reasoning tags and prompt injection behavior.
  4. Zero-Argument Fix: Fixed a crash when tool calls contained no arguments.

Key Changes

  • vllm_mlx/server.py: Modified streaming loop for reliable tool call detection.
  • vllm_mlx/tool_parsers/glm47_tool_parser.py: Complete rewrite using a robust state machine.
  • vllm_mlx/reasoning/glm4_parser.py: New parser for GLM-4 reasoning.
  • vllm_mlx/reasoning/__init__.py: Registered the new parser.
  • tests/test_tool_parsers.py & tests/test_reasoning_parser.py: Added unit tests for the new logic.

Verification

  • Verified with mlx-community/GLM-4.7-Flash-bf16.
  • Both non-streaming and streaming tool calls (including parallel calls) now work correctly.
  • Reasoning content is correctly identified and separated from tool calls.

The function was missing a return statement after successfully loading
the model with mlx_lm.load(), causing None to be returned and
unpacking to fail with TypeError.
Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing GLM-4.7 streaming tool calls. The state-machine approach is the right call, and the server.py restructuring addresses a real problem (tool parsing unreachable when reasoning parser is active). A few things to address before this is ready:

  1. Dead code in glm47_tool_parser.py: After the return None at line 705 (end of the while loop), there are ~200 lines of unreachable code (a second state-machine implementation). This includes a recursive call and references to self.current_arguments which is never initialized. Looks like an earlier draft that was left in. Please remove it.

  2. server.py tool parsing input: The tool parser receives delta_text as input (line ~2062), but when a reasoning parser is active, delta_text still contains raw text with <think>/</think> tags. The tool parser should receive the content portion (after reasoning extraction), not the raw delta. Otherwise the tool parser may misparse reasoning tags.

  3. __all__ in reasoning/__init__.py: Qwen3ReasoningParser and DeepSeekR1ReasoningParser are added to __all__ but not imported at module level. This will break from vllm_mlx.reasoning import Qwen3ReasoningParser. Either add the imports or remove them from __all__.

  4. GLM4ReasoningParser is a no-op: The extract_reasoning override is identical to the base class (BaseThinkingReasoningParser) implementation. If the only purpose is to register under the name "glm4", you could just do register_parser("glm4", BaseThinkingReasoningParser) or create a minimal subclass without overriding anything. Not a blocker, but simplifies things.

@waybarrios
Copy link
Copy Markdown
Owner

@b2ornot2b hey, are you still working on this? There are a few things that need to be addressed before we can merge (dead code after line 705, __all__ exports, GLM4 parser simplification, etc. — see @Thump604's review above).

If you need help or don't have time to finish it, let me know — I'm happy to pick up the remaining changes and get this merged.

@b2ornot2b
Copy link
Copy Markdown
Author

b2ornot2b commented Apr 11, 2026 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants