fix: overhaul GLM-4.7-Flash streaming tool calls and add GLM4 reasoning parser#246
fix: overhaul GLM-4.7-Flash streaming tool calls and add GLM4 reasoning parser#246b2ornot2b wants to merge 4 commits intowaybarrios:mainfrom
Conversation
The function was missing a return statement after successfully loading the model with mlx_lm.load(), causing None to be returned and unpacking to fail with TypeError.
Thump604
left a comment
There was a problem hiding this comment.
Thanks for fixing GLM-4.7 streaming tool calls. The state-machine approach is the right call, and the server.py restructuring addresses a real problem (tool parsing unreachable when reasoning parser is active). A few things to address before this is ready:
-
Dead code in
glm47_tool_parser.py: After thereturn Noneat line 705 (end of the while loop), there are ~200 lines of unreachable code (a second state-machine implementation). This includes a recursive call and references toself.current_argumentswhich is never initialized. Looks like an earlier draft that was left in. Please remove it. -
server.pytool parsing input: The tool parser receivesdelta_textas input (line ~2062), but when a reasoning parser is active,delta_textstill contains raw text with<think>/</think>tags. The tool parser should receive thecontentportion (after reasoning extraction), not the raw delta. Otherwise the tool parser may misparse reasoning tags. -
__all__inreasoning/__init__.py:Qwen3ReasoningParserandDeepSeekR1ReasoningParserare added to__all__but not imported at module level. This will breakfrom vllm_mlx.reasoning import Qwen3ReasoningParser. Either add the imports or remove them from__all__. -
GLM4ReasoningParser is a no-op: The
extract_reasoningoverride is identical to the base class (BaseThinkingReasoningParser) implementation. If the only purpose is to register under the name"glm4", you could just doregister_parser("glm4", BaseThinkingReasoningParser)or create a minimal subclass without overriding anything. Not a blocker, but simplifies things.
|
@b2ornot2b hey, are you still working on this? There are a few things that need to be addressed before we can merge (dead code after line 705, If you need help or don't have time to finish it, let me know — I'm happy to pick up the remaining changes and get this merged. |
|
Would appreciate if you could pick up the remaining changes. Am travelling,
and will not be able to get back to this or my test setup.
…On Sat, 11 Apr 2026 at 7:48 AM, Wayner Barrios ***@***.***> wrote:
*waybarrios* left a comment (waybarrios/vllm-mlx#246)
<#246 (comment)>
@b2ornot2b <https://github.com/b2ornot2b> hey, are you still working on
this? There are a few things that need to be addressed before we can merge
(dead code after line 705, __all__ exports, GLM4 parser simplification,
etc. — see @Thump604 <https://github.com/Thump604>'s review above).
If you need help or don't have time to finish it, let me know — I'm happy
to pick up the remaining changes and get this merged.
—
Reply to this email directly, view it on GitHub
<#246 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAEXPXPP4IJHDZVMTZYHZO34VGTO5AVCNFSM6AAAAACXJ554J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DEMRXHAZTGMBXG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Summary
This PR fixes the broken streaming tool-calling implementation for GLM-4.7-Flash and introduces a dedicated
GLM4ReasoningParserto handle the model's reasoning/thinking behavior.The Problem
The previous implementation failed in streaming mode due to:
server.pyhad a greedycontinuethat skipped tool parsing whenever the reasoning parser returnedNone(e.g., during transitions)."</tool_call>" in delta_text) failed when BPE tokens arrived in separate chunks.The Solution
server.py): Removed the greedycontinueand implemented strict fall-through logic. The tool parser is now always evaluated, ensuring transitions from reasoning to tool calling are captured.glm47_tool_parser.py): Replaced string matching with an index-driven state machine. It trackslast_parsed_indexand maintains explicit states (PARSING_NAME,PARSING_ARGUMENTS), allowing it to handle fragmented tokens and stream arguments incrementally.Key Changes
vllm_mlx/server.py: Modified streaming loop for reliable tool call detection.vllm_mlx/tool_parsers/glm47_tool_parser.py: Complete rewrite using a robust state machine.vllm_mlx/reasoning/glm4_parser.py: New parser for GLM-4 reasoning.vllm_mlx/reasoning/__init__.py: Registered the new parser.tests/test_tool_parsers.py&tests/test_reasoning_parser.py: Added unit tests for the new logic.Verification
mlx-community/GLM-4.7-Flash-bf16.