simple-engine: keep tool chat on the streaming execution path by krystophny · Pull Request #222 · waybarrios/vllm-mlx

krystophny · 2026-03-24T18:22:43Z

Summary

Make tool-enabled non-stream SimpleEngine.chat() aggregate the existing streaming chat path instead of calling the separate blocking chat path.

Why

On some local models, non-stream tool-enabled chat stalls while the streaming path completes normally. This change keeps one tool-capable execution path for simple-engine chat and still returns a normal non-stream GenerationOutput.

What changed

tool-enabled non-stream chat now aggregates stream_chat() output
preserve the streamed token ids from the final output
keep the streamed completion token count and final finish reason
add regression coverage for the tool-chat path

Status

refreshed onto current upstream main (b4fa030) on 2026-04-09
no logic changes beyond the base refresh

Files to review

vllm_mlx/engine/simple.py
tests/test_simple_engine.py

Validation

python -m pytest tests/test_simple_engine.py::TestSimpleEngineConcurrency::test_chat_with_tools_aggregates_streaming_path -q -> 1 passed
note: the broader async tests/test_simple_engine.py run on current upstream depends on the separate harness refresh in #226

krystophny · 2026-03-24T18:47:10Z

@waybarrios sorry for the PR spam. We are intensively testing a lot of LLMs with codex and opencode locally on https://github.com/krystophny/fortbench and saw that vLLM-MLX does a much better job performance-wise than llama.cpp . This should be the last one, now benchmarks are running overnight.

Thump604 · 2026-04-08T00:09:42Z

@waybarrios, @krystophny: independent endorsement plus coordination note.

This PR makes tool-enabled non-stream SimpleEngine.chat() aggregate the existing streaming chat path instead of calling a separate blocking chat path. The accumulator pattern is the right shape: one source of truth for the chat execution, with non-stream wrapping streaming. Mergeable on current main.

Coordination note: the accumulator pattern in this PR (non-stream wrapping streaming) is also useful for wiring SpecPrefill into the non-streaming path, since SpecPrefill currently engages on the streaming surface only. If both fixes land in the same direction (streaming as the single source of truth for SimpleEngine execution), the result is cleaner than fixing them separately. Recommend this PR first since it is smaller and establishes the pattern.

Thump604

The accumulator-over-streaming shape is the right call. I've been sketching the same structure for SimpleEngine.generate() (the /v1/completions engine path) as a follow-up, since the bug class is symmetric: streaming has feature X, non-streaming forgets X. Landing this one first means the generate-side PR is a smaller delta against an established pattern rather than introducing the pattern.

One small question, not blocking: should _stream_chat_tool_llm eventually become the shared path for all SimpleEngine chat, or do you see it staying tool-specific? Either works; I'm just trying to gauge whether I should extend my generate() refactor to go through a shared helper or keep it on its own branch.

Approving based on the design and the test coverage you added.

…er stream_generate stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR waybarrios#222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR waybarrios#265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.

…l generate+stream_generate Pre-existing regression from an earlier rebase that dropped bdf7dcc's llm.py additions. The server.py request handlers still pass top_k, min_p, presence_penalty, repetition_penalty through to SimpleEngine, which forwards them via **kwargs to MLXLanguageModel.chat() (which accepts **kwargs) which then calls self.generate(..., **kwargs). But MLXLanguageModel.generate() and stream_generate() had been left with only (temperature, top_p, repetition_penalty) in their signatures, so any non-MLLM SimpleEngine request crashed with: TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in the feature matrix regression sweep after the Session 87 cherry-picks of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause this regression — they exposed it by finally running the LLM-path tests that no one had exercised since the rebase happened. Confirmed via stderr.log: TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k' TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Fix: restore the signatures and bodies of _create_sampler, _create_logits_processors, generate, and stream_generate to match bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter and non-str prompt support on stream_generate. Adds **kwargs to both generate and stream_generate so future param additions degrade gracefully instead of crashing. This is a runtime-local fix. The equivalent upstream fix lives in bdf7dcc which was never upstreamed (confirmed via git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR to upstream could carry this forward. Verification: bin/verify-patches: 33/33 clean Full feature matrix regression sweep pending re-run after this commit. Related: runtime PR waybarrios#265 (waybarrios#265) fixed the CompletionRequest schema side of the same bdf7dcc drop; this commit fixes the engine-model side.

…er stream_generate Closes a real /v1/completions contract gap: prior to this commit, server.py::create_completion extracted per-request specprefill / specprefill_keep_pct and passed them as gen_kwargs to engine.generate(), and SimpleEngine.generate() forwarded them via **kwargs to MLXLanguageModel.generate() (which now accepts **kwargs since a8f58bf restored the bdf7dcc sampling params). Both the server and the engine advertised per-request SpecPrefill on /v1/completions, but the overrides were silently dropped because the direct self._model.generate() path never consumes the specprefill kwargs — they just flowed into **kwargs and landed in mlx_lm.generate which ignores them. Fix: make generate() a thin accumulator over stream_generate(). stream_generate() is the only code path that actually pops `specprefill` / `specprefill_keep_pct` from kwargs, threshold-checks, and routes to _stream_generate_specprefill. By iterating it and returning the last GenerationOutput, non-streaming /v1/completions clients get the same SpecPrefill engagement as streaming clients. Verified against a live Qwen3.5-4B SimpleEngine runtime (simple-spec profile) with a 27 KB prompt (~6007 tokens, under the 8192 threshold so forcibly enabled via extra_body.specprefill=true): SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now accurate (was always 0 on the old direct LLM path because mlx_lm.generate() never sets it). Scope: only generate() gets the accumulator treatment in this commit. chat() stays on the direct path for non-tool LLM cases; the tool-enabled accumulator from PR waybarrios#222 cherry-pick handles the chat side. Full chat() accumulator refactor is a follow-up that will layer on top of PR waybarrios#222 once it merges upstream. This commit is independently upstreamable as a follow-up to PR waybarrios#248 (Phase 4 decode fix). Worth filing as a small PR on waybarrios/vllm-mlx to close the same gap upstream.

* fix: unify tool-enabled simple chat on streaming path * fix: preserve simple chat contracts on streaming path * fix: keep tool chat on the streaming execution path * fix: preserve streamed completion token counts

The try/except block computing `tokens` via tokenizer.encode() was unused -- the return statement already reads from final_output.tokens.

krystophny · 2026-04-09T06:40:52Z

Force-pushed a refresh onto current upstream main (b4fa030). No logic change beyond the base refresh. Re the helper question: short term I'd keep _stream_chat_tool_llm tool-specific. The pattern I want to standardize is 'non-stream wraps streaming for featureful paths', not forcing all chat paths through one shared helper yet. Validation: python -m pytest tests/test_simple_engine.py::TestSimpleEngineConcurrency::test_chat_with_tools_aggregates_streaming_path -q -> 1 passed.

Thump604 · 2026-04-09T12:31:20Z

Refresh confirmed on head 15e3ca5 against upstream main b4fa030. No logic change on the accumulator path; the small vllm_mlx/engine/simple.py delta (+26/0) is the expected shape for the rebase.

On the helper question: keeping _stream_chat_tool_llm tool-specific for now is the pragmatic call. The pattern worth standardizing is "non-stream wraps streaming for featureful paths," not "one shared helper for every chat path." The plain LLM non-tool branch can get its own accumulator independently once #218 lands and the chat_template_kwargs forwarding is stable, rather than trying to collapse everything into one helper in a single PR.

CI green across the full matrix. Prior APPROVED review at f9468bc stands for the refreshed head.

…er stream_generate stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR waybarrios#222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR waybarrios#265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.

Thump604 · 2026-04-11T16:31:51Z

I rebased this onto current main locally and validated it cleanly. GitHub compare sees the restack branch correctly, but is misbehaving on this one from the CLI, so I’m opening the writable replacement branch via the API path instead and will link it here once GitHub returns the PR URL. Local validation on the restack branch:\n- python -m py_compile vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- python -m black --check --fast vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- pytest -q tests/test_simple_engine.py

Thump604 · 2026-04-11T16:32:14Z

Replacement restack is up at #285: https://github.com/waybarrios/vllm-mlx/pull/285\n\nThat branch carries the rebased tool-chat aggregation change on a writable branch and passed local validation here:\n- python -m py_compile vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- python -m black --check --fast vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- pytest -q tests/test_simple_engine.py\n\nI’m using the replacement branch because the original computor-org fork could not be updated directly from this auth context.

Thump604 · 2026-04-11T16:33:18Z

Closing as superseded by #285, which carries the rebased SimpleEngine tool-chat aggregation change on a writable branch and validated cleanly on current main. The original computor-org fork branch could not be updated directly from this auth context, so the merge path moved to the replacement PR.

…er stream_generate (#266) stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR #222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR #222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR #265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.

This was referenced Mar 25, 2026

[Tracking] Upstream backlog and merge plan computor-org/vllm-mlx#12

Open

simple-engine: finish upstreaming tool chat streaming-path unification computor-org/vllm-mlx#20

Closed

Thump604 approved these changes Apr 8, 2026

View reviewed changes

Thump604 mentioned this pull request Apr 8, 2026

refactor(simple): SimpleEngine.generate() thin accumulator over stream_generate #266

Merged

krystophny added 3 commits April 9, 2026 08:33

simple-engine: unify tool-enabled chat on streaming path (#10)

55c2fe7

* fix: unify tool-enabled simple chat on streaming path * fix: preserve simple chat contracts on streaming path * fix: keep tool chat on the streaming execution path * fix: preserve streamed completion token counts

fix: preserve streamed tool-chat token ids

8a51e7f

remove dead token-encode block in tool-chat fallback

f9468bc

The try/except block computing `tokens` via tokenizer.encode() was unused -- the return statement already reads from final_output.tokens.

krystophny force-pushed the fix/simple-engine-tool-chat-upstream branch from 7019a4d to f9468bc Compare April 9, 2026 06:35

style: format simple engine tool-chat test

15e3ca5

Thump604 mentioned this pull request Apr 11, 2026

simple-engine: keep tool chat on the streaming execution path #285

Merged

Thump604 closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

simple-engine: keep tool chat on the streaming execution path#222

simple-engine: keep tool chat on the streaming execution path#222
krystophny wants to merge 4 commits intowaybarrios:mainfrom
computor-org:fix/simple-engine-tool-chat-upstream

krystophny commented Mar 24, 2026 •

edited

Loading

Uh oh!

krystophny commented Mar 24, 2026

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

Thump604 left a comment

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krystophny commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What changed

Status

Files to review

Validation

Uh oh!

krystophny commented Mar 24, 2026

Uh oh!

Thump604 commented Apr 8, 2026

Uh oh!

Thump604 left a comment

Choose a reason for hiding this comment

Uh oh!

krystophny commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 9, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

Thump604 commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

krystophny commented Mar 24, 2026 •

edited

Loading