simple-engine: keep tool chat on the streaming execution path#222
simple-engine: keep tool chat on the streaming execution path#222krystophny wants to merge 4 commits intowaybarrios:mainfrom
Conversation
|
@waybarrios sorry for the PR spam. We are intensively testing a lot of LLMs with codex and opencode locally on https://github.com/krystophny/fortbench and saw that vLLM-MLX does a much better job performance-wise than llama.cpp . This should be the last one, now benchmarks are running overnight. |
|
@waybarrios, @krystophny: independent endorsement plus coordination note. This PR makes tool-enabled non-stream Coordination note: the accumulator pattern in this PR (non-stream wrapping streaming) is also useful for wiring SpecPrefill into the non-streaming path, since SpecPrefill currently engages on the streaming surface only. If both fixes land in the same direction (streaming as the single source of truth for SimpleEngine execution), the result is cleaner than fixing them separately. Recommend this PR first since it is smaller and establishes the pattern. |
Thump604
left a comment
There was a problem hiding this comment.
The accumulator-over-streaming shape is the right call. I've been sketching the same structure for SimpleEngine.generate() (the /v1/completions engine path) as a follow-up, since the bug class is symmetric: streaming has feature X, non-streaming forgets X. Landing this one first means the generate-side PR is a smaller delta against an established pattern rather than introducing the pattern.
One small question, not blocking: should _stream_chat_tool_llm eventually become the shared path for all SimpleEngine chat, or do you see it staying tool-specific? Either works; I'm just trying to gauge whether I should extend my generate() refactor to go through a shared helper or keep it on its own branch.
Approving based on the design and the test coverage you added.
…er stream_generate
stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.
Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:
- SpecPrefill engagement when `specprefill=true` is set (top-level or
extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
streaming path
- Same thread-safety (stream_generate holds self._generation_lock
around the MLX call)
Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.
Tests:
- New test_generate_accumulates_over_stream_generate stubs
stream_generate with an async generator, calls generate() with
per-request specprefill kwargs, and asserts:
* final output fields (text, tokens, prompt_tokens,
completion_tokens, finish_reason, finished) match the last yielded
chunk
* specprefill / specprefill_keep_pct were forwarded through to
stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
empty-stream edge case (returns GenerationOutput(text="",
finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
so test_lock_prevents_concurrent_generate still observes
serialization through the new accumulator path
Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:
SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s
prompt_tokens reporting is now 6007 (was always 0 before).
Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
…l generate+stream_generate Pre-existing regression from an earlier rebase that dropped bdf7dcc's llm.py additions. The server.py request handlers still pass top_k, min_p, presence_penalty, repetition_penalty through to SimpleEngine, which forwards them via **kwargs to MLXLanguageModel.chat() (which accepts **kwargs) which then calls self.generate(..., **kwargs). But MLXLanguageModel.generate() and stream_generate() had been left with only (temperature, top_p, repetition_penalty) in their signatures, so any non-MLLM SimpleEngine request crashed with: TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in the feature matrix regression sweep after the Session 87 cherry-picks of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause this regression — they exposed it by finally running the LLM-path tests that no one had exercised since the rebase happened. Confirmed via stderr.log: TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k' TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k' Fix: restore the signatures and bodies of _create_sampler, _create_logits_processors, generate, and stream_generate to match bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter and non-str prompt support on stream_generate. Adds **kwargs to both generate and stream_generate so future param additions degrade gracefully instead of crashing. This is a runtime-local fix. The equivalent upstream fix lives in bdf7dcc which was never upstreamed (confirmed via git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR to upstream could carry this forward. Verification: bin/verify-patches: 33/33 clean Full feature matrix regression sweep pending re-run after this commit. Related: runtime PR waybarrios#265 (waybarrios#265) fixed the CompletionRequest schema side of the same bdf7dcc drop; this commit fixes the engine-model side.
…er stream_generate Closes a real /v1/completions contract gap: prior to this commit, server.py::create_completion extracted per-request specprefill / specprefill_keep_pct and passed them as gen_kwargs to engine.generate(), and SimpleEngine.generate() forwarded them via **kwargs to MLXLanguageModel.generate() (which now accepts **kwargs since a8f58bf restored the bdf7dcc sampling params). Both the server and the engine advertised per-request SpecPrefill on /v1/completions, but the overrides were silently dropped because the direct self._model.generate() path never consumes the specprefill kwargs — they just flowed into **kwargs and landed in mlx_lm.generate which ignores them. Fix: make generate() a thin accumulator over stream_generate(). stream_generate() is the only code path that actually pops `specprefill` / `specprefill_keep_pct` from kwargs, threshold-checks, and routes to _stream_generate_specprefill. By iterating it and returning the last GenerationOutput, non-streaming /v1/completions clients get the same SpecPrefill engagement as streaming clients. Verified against a live Qwen3.5-4B SimpleEngine runtime (simple-spec profile) with a 27 KB prompt (~6007 tokens, under the 8192 threshold so forcibly enabled via extra_body.specprefill=true): SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now accurate (was always 0 on the old direct LLM path because mlx_lm.generate() never sets it). Scope: only generate() gets the accumulator treatment in this commit. chat() stays on the direct path for non-tool LLM cases; the tool-enabled accumulator from PR waybarrios#222 cherry-pick handles the chat side. Full chat() accumulator refactor is a follow-up that will layer on top of PR waybarrios#222 once it merges upstream. This commit is independently upstreamable as a follow-up to PR waybarrios#248 (Phase 4 decode fix). Worth filing as a small PR on waybarrios/vllm-mlx to close the same gap upstream.
* fix: unify tool-enabled simple chat on streaming path * fix: preserve simple chat contracts on streaming path * fix: keep tool chat on the streaming execution path * fix: preserve streamed completion token counts
The try/except block computing `tokens` via tokenizer.encode() was unused -- the return statement already reads from final_output.tokens.
7019a4d to
f9468bc
Compare
|
Force-pushed a refresh onto current upstream |
|
Refresh confirmed on head 15e3ca5 against upstream main b4fa030. No logic change on the accumulator path; the small vllm_mlx/engine/simple.py delta (+26/0) is the expected shape for the rebase. On the helper question: keeping _stream_chat_tool_llm tool-specific for now is the pragmatic call. The pattern worth standardizing is "non-stream wraps streaming for featureful paths," not "one shared helper for every chat path." The plain LLM non-tool branch can get its own accumulator independently once #218 lands and the chat_template_kwargs forwarding is stable, rather than trying to collapse everything into one helper in a single PR. CI green across the full matrix. Prior APPROVED review at f9468bc stands for the refreshed head. |
…er stream_generate
stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.
Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:
- SpecPrefill engagement when `specprefill=true` is set (top-level or
extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
streaming path
- Same thread-safety (stream_generate holds self._generation_lock
around the MLX call)
Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.
Tests:
- New test_generate_accumulates_over_stream_generate stubs
stream_generate with an async generator, calls generate() with
per-request specprefill kwargs, and asserts:
* final output fields (text, tokens, prompt_tokens,
completion_tokens, finish_reason, finished) match the last yielded
chunk
* specprefill / specprefill_keep_pct were forwarded through to
stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
empty-stream edge case (returns GenerationOutput(text="",
finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
so test_lock_prevents_concurrent_generate still observes
serialization through the new accumulator path
Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:
SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s
prompt_tokens reporting is now 6007 (was always 0 before).
Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
|
I rebased this onto current main locally and validated it cleanly. GitHub compare sees the restack branch correctly, but is misbehaving on this one from the CLI, so I’m opening the writable replacement branch via the API path instead and will link it here once GitHub returns the PR URL. Local validation on the restack branch:\n- python -m py_compile vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- python -m black --check --fast vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- pytest -q tests/test_simple_engine.py |
|
Replacement restack is up at #285: https://github.com/waybarrios/vllm-mlx/pull/285\n\nThat branch carries the rebased tool-chat aggregation change on a writable branch and passed local validation here:\n- python -m py_compile vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- python -m black --check --fast vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- pytest -q tests/test_simple_engine.py\n\nI’m using the replacement branch because the original computor-org fork could not be updated directly from this auth context. |
|
Closing as superseded by #285, which carries the rebased SimpleEngine tool-chat aggregation change on a writable branch and validated cleanly on current main. The original computor-org fork branch could not be updated directly from this auth context, so the merge path moved to the replacement PR. |
…er stream_generate (#266) stream_generate() is the only code path that consumes per-request SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and routes through _stream_generate_specprefill() when engaged. The prior direct self._model.generate() path silently dropped those overrides: server.py's create_completion() extracts them from extra_body and forwards to engine.generate(), engine.generate() forwards via **kwargs to _model.generate(), but _model.generate() (mlx_lm.generate) does not consume them. Non-streaming /v1/completions clients that sent `{"extra_body": {"specprefill": true}}` had their overrides silently no-op'd. Fix: make SimpleEngine.generate() a thin accumulator that iterates self.stream_generate() and returns the last GenerationOutput. Matches the pattern PR #222 established for tool-enabled chat(). Non-streaming clients now get: - SpecPrefill engagement when `specprefill=true` is set (top-level or extra_body fallback via whatever helper server.py uses) - Accurate `prompt_tokens` reporting (the old path returned 0 because mlx_lm.generate never populates it) - Chat-template and reasoning-parser behavior consistent with the streaming path - Same thread-safety (stream_generate holds self._generation_lock around the MLX call) Scope: only generate() changes. chat() stays on its current path; extending chat() to the full accumulator pattern is a separate follow-up on top of PR #222. Tests: - New test_generate_accumulates_over_stream_generate stubs stream_generate with an async generator, calls generate() with per-request specprefill kwargs, and asserts: * final output fields (text, tokens, prompt_tokens, completion_tokens, finish_reason, finished) match the last yielded chunk * specprefill / specprefill_keep_pct were forwarded through to stream_generate - New test_generate_empty_stream_returns_safe_default covers the empty-stream edge case (returns GenerationOutput(text="", finish_reason="stop") rather than raising) - Existing mock_model fixture extended with stream_generate tracking so test_lock_prevents_concurrent_generate still observes serialization through the new accumulator path Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2 Ultra with a ~6K token prompt and extra_body.specprefill=true forcing SpecPrefill below the 8192 threshold: SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s prompt_tokens reporting is now 6007 (was always 0 before). Related: companion PR #265 (CompletionRequest schema + server-side extract_body -> gen_kwargs threading) which opens the wire from /v1/completions to engine.generate(). This PR closes the wire on the engine side.
Summary
Make tool-enabled non-stream
SimpleEngine.chat()aggregate the existing streaming chat path instead of calling the separate blocking chat path.Why
On some local models, non-stream tool-enabled chat stalls while the streaming path completes normally. This change keeps one tool-capable execution path for simple-engine chat and still returns a normal non-stream
GenerationOutput.What changed
stream_chat()outputStatus
main(b4fa030) on 2026-04-09Files to review
vllm_mlx/engine/simple.pytests/test_simple_engine.pyValidation
python -m pytest tests/test_simple_engine.py::TestSimpleEngineConcurrency::test_chat_with_tools_aggregates_streaming_path -q->1 passedtests/test_simple_engine.pyrun on current upstream depends on the separate harness refresh in#226