Skip to content

simple-engine: keep tool chat on the streaming execution path#222

Closed
krystophny wants to merge 4 commits intowaybarrios:mainfrom
computor-org:fix/simple-engine-tool-chat-upstream
Closed

simple-engine: keep tool chat on the streaming execution path#222
krystophny wants to merge 4 commits intowaybarrios:mainfrom
computor-org:fix/simple-engine-tool-chat-upstream

Conversation

@krystophny
Copy link
Copy Markdown
Contributor

@krystophny krystophny commented Mar 24, 2026

Summary

Make tool-enabled non-stream SimpleEngine.chat() aggregate the existing streaming chat path instead of calling the separate blocking chat path.

Why

On some local models, non-stream tool-enabled chat stalls while the streaming path completes normally. This change keeps one tool-capable execution path for simple-engine chat and still returns a normal non-stream GenerationOutput.

What changed

  • tool-enabled non-stream chat now aggregates stream_chat() output
  • preserve the streamed token ids from the final output
  • keep the streamed completion token count and final finish reason
  • add regression coverage for the tool-chat path

Status

  • refreshed onto current upstream main (b4fa030) on 2026-04-09
  • no logic changes beyond the base refresh

Files to review

  • vllm_mlx/engine/simple.py
  • tests/test_simple_engine.py

Validation

  • python -m pytest tests/test_simple_engine.py::TestSimpleEngineConcurrency::test_chat_with_tools_aggregates_streaming_path -q -> 1 passed
  • note: the broader async tests/test_simple_engine.py run on current upstream depends on the separate harness refresh in #226

@krystophny
Copy link
Copy Markdown
Contributor Author

@waybarrios sorry for the PR spam. We are intensively testing a lot of LLMs with codex and opencode locally on https://github.com/krystophny/fortbench and saw that vLLM-MLX does a much better job performance-wise than llama.cpp . This should be the last one, now benchmarks are running overnight.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 8, 2026

@waybarrios, @krystophny: independent endorsement plus coordination note.

This PR makes tool-enabled non-stream SimpleEngine.chat() aggregate the existing streaming chat path instead of calling a separate blocking chat path. The accumulator pattern is the right shape: one source of truth for the chat execution, with non-stream wrapping streaming. Mergeable on current main.

Coordination note: the accumulator pattern in this PR (non-stream wrapping streaming) is also useful for wiring SpecPrefill into the non-streaming path, since SpecPrefill currently engages on the streaming surface only. If both fixes land in the same direction (streaming as the single source of truth for SimpleEngine execution), the result is cleaner than fixing them separately. Recommend this PR first since it is smaller and establishes the pattern.

Copy link
Copy Markdown
Collaborator

@Thump604 Thump604 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The accumulator-over-streaming shape is the right call. I've been sketching the same structure for SimpleEngine.generate() (the /v1/completions engine path) as a follow-up, since the bug class is symmetric: streaming has feature X, non-streaming forgets X. Landing this one first means the generate-side PR is a smaller delta against an established pattern rather than introducing the pattern.

One small question, not blocking: should _stream_chat_tool_llm eventually become the shared path for all SimpleEngine chat, or do you see it staying tool-specific? Either works; I'm just trying to gauge whether I should extend my generate() refactor to go through a shared helper or keep it on its own branch.

Approving based on the design and the test coverage you added.

Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 8, 2026
…er stream_generate

stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.

Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:

- SpecPrefill engagement when `specprefill=true` is set (top-level or
  extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
  mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
  streaming path
- Same thread-safety (stream_generate holds self._generation_lock
  around the MLX call)

Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.

Tests:
- New test_generate_accumulates_over_stream_generate stubs
  stream_generate with an async generator, calls generate() with
  per-request specprefill kwargs, and asserts:
  * final output fields (text, tokens, prompt_tokens,
    completion_tokens, finish_reason, finished) match the last yielded
    chunk
  * specprefill / specprefill_keep_pct were forwarded through to
    stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
  empty-stream edge case (returns GenerationOutput(text="",
  finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
  so test_lock_prevents_concurrent_generate still observes
  serialization through the new accumulator path

Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:

  SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

prompt_tokens reporting is now 6007 (was always 0 before).

Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 9, 2026
…l generate+stream_generate

Pre-existing regression from an earlier rebase that dropped bdf7dcc's
llm.py additions. The server.py request handlers still pass top_k,
min_p, presence_penalty, repetition_penalty through to SimpleEngine,
which forwards them via **kwargs to MLXLanguageModel.chat() (which
accepts **kwargs) which then calls self.generate(..., **kwargs). But
MLXLanguageModel.generate() and stream_generate() had been left with
only (temperature, top_p, repetition_penalty) in their signatures, so
any non-MLLM SimpleEngine request crashed with:

    TypeError: MLXLanguageModel.stream_generate() got an unexpected
    keyword argument 'top_k'

Observed as 0/6 on simple-base, simple-mtp, and simple-spec profiles in
the feature matrix regression sweep after the Session 87 cherry-picks
of PRs waybarrios#248, waybarrios#229, waybarrios#218, waybarrios#222 landed. The cherry-picks did not cause
this regression — they exposed it by finally running the LLM-path
tests that no one had exercised since the rebase happened. Confirmed
via stderr.log:

  TypeError: MLXLanguageModel.generate() got an unexpected keyword argument 'top_k'
  TypeError: MLXLanguageModel.stream_generate() got an unexpected keyword argument 'top_k'

Fix: restore the signatures and bodies of _create_sampler,
_create_logits_processors, generate, and stream_generate to match
bdf7dcc's original intent. Preserves PR waybarrios#248's prompt_cache parameter
and non-str prompt support on stream_generate. Adds **kwargs to both
generate and stream_generate so future param additions degrade
gracefully instead of crashing.

This is a runtime-local fix. The equivalent upstream fix lives in
bdf7dcc which was never upstreamed (confirmed via
git merge-base --is-ancestor bdf7dcc upstream/main). A follow-up PR
to upstream could carry this forward.

Verification:
  bin/verify-patches: 33/33 clean
  Full feature matrix regression sweep pending re-run after this commit.

Related: runtime PR waybarrios#265 (waybarrios#265) fixed the
CompletionRequest schema side of the same bdf7dcc drop; this commit
fixes the engine-model side.
Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 9, 2026
…er stream_generate

Closes a real /v1/completions contract gap: prior to this commit,
server.py::create_completion extracted per-request specprefill /
specprefill_keep_pct and passed them as gen_kwargs to
engine.generate(), and SimpleEngine.generate() forwarded them via
**kwargs to MLXLanguageModel.generate() (which now accepts **kwargs
since a8f58bf restored the bdf7dcc sampling params). Both the server
and the engine advertised per-request SpecPrefill on /v1/completions,
but the overrides were silently dropped because the direct
self._model.generate() path never consumes the specprefill kwargs —
they just flowed into **kwargs and landed in mlx_lm.generate which
ignores them.

Fix: make generate() a thin accumulator over stream_generate().
stream_generate() is the only code path that actually pops
`specprefill` / `specprefill_keep_pct` from kwargs, threshold-checks,
and routes to _stream_generate_specprefill. By iterating it and
returning the last GenerationOutput, non-streaming /v1/completions
clients get the same SpecPrefill engagement as streaming clients.

Verified against a live Qwen3.5-4B SimpleEngine runtime (simple-spec
profile) with a 27 KB prompt (~6007 tokens, under the 8192 threshold
so forcibly enabled via extra_body.specprefill=true):

  SpecPrefill: scored 6007 tokens in 5.3s,
  sparse prefill 1815/6007 (keep=30%) in 1.1s

prompt_tokens reporting is now accurate (was always 0 on the old
direct LLM path because mlx_lm.generate() never sets it).

Scope: only generate() gets the accumulator treatment in this commit.
chat() stays on the direct path for non-tool LLM cases; the
tool-enabled accumulator from PR waybarrios#222 cherry-pick handles the chat
side. Full chat() accumulator refactor is a follow-up that will
layer on top of PR waybarrios#222 once it merges upstream.

This commit is independently upstreamable as a follow-up to PR waybarrios#248
(Phase 4 decode fix). Worth filing as a small PR on
waybarrios/vllm-mlx to close the same gap upstream.
* fix: unify tool-enabled simple chat on streaming path

* fix: preserve simple chat contracts on streaming path

* fix: keep tool chat on the streaming execution path

* fix: preserve streamed completion token counts
The try/except block computing `tokens` via tokenizer.encode() was
unused -- the return statement already reads from final_output.tokens.
@krystophny krystophny force-pushed the fix/simple-engine-tool-chat-upstream branch from 7019a4d to f9468bc Compare April 9, 2026 06:35
@krystophny
Copy link
Copy Markdown
Contributor Author

Force-pushed a refresh onto current upstream main (b4fa030). No logic change beyond the base refresh. Re the helper question: short term I'd keep _stream_chat_tool_llm tool-specific. The pattern I want to standardize is 'non-stream wraps streaming for featureful paths', not forcing all chat paths through one shared helper yet. Validation: python -m pytest tests/test_simple_engine.py::TestSimpleEngineConcurrency::test_chat_with_tools_aggregates_streaming_path -q -> 1 passed.

@Thump604
Copy link
Copy Markdown
Collaborator

Thump604 commented Apr 9, 2026

Refresh confirmed on head 15e3ca5 against upstream main b4fa030. No logic change on the accumulator path; the small vllm_mlx/engine/simple.py delta (+26/0) is the expected shape for the rebase.

On the helper question: keeping _stream_chat_tool_llm tool-specific for now is the pragmatic call. The pattern worth standardizing is "non-stream wraps streaming for featureful paths," not "one shared helper for every chat path." The plain LLM non-tool branch can get its own accumulator independently once #218 lands and the chat_template_kwargs forwarding is stable, rather than trying to collapse everything into one helper in a single PR.

CI green across the full matrix. Prior APPROVED review at f9468bc stands for the refreshed head.

Thump604 added a commit to Thump604/vllm-mlx that referenced this pull request Apr 11, 2026
…er stream_generate

stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.

Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR waybarrios#222 established for tool-enabled chat(). Non-streaming
clients now get:

- SpecPrefill engagement when `specprefill=true` is set (top-level or
  extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
  mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
  streaming path
- Same thread-safety (stream_generate holds self._generation_lock
  around the MLX call)

Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR waybarrios#222.

Tests:
- New test_generate_accumulates_over_stream_generate stubs
  stream_generate with an async generator, calls generate() with
  per-request specprefill kwargs, and asserts:
  * final output fields (text, tokens, prompt_tokens,
    completion_tokens, finish_reason, finished) match the last yielded
    chunk
  * specprefill / specprefill_keep_pct were forwarded through to
    stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
  empty-stream edge case (returns GenerationOutput(text="",
  finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
  so test_lock_prevents_concurrent_generate still observes
  serialization through the new accumulator path

Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:

  SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

prompt_tokens reporting is now 6007 (was always 0 before).

Related: companion PR waybarrios#265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
@Thump604
Copy link
Copy Markdown
Collaborator

I rebased this onto current main locally and validated it cleanly. GitHub compare sees the restack branch correctly, but is misbehaving on this one from the CLI, so I’m opening the writable replacement branch via the API path instead and will link it here once GitHub returns the PR URL. Local validation on the restack branch:\n- python -m py_compile vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- python -m black --check --fast vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- pytest -q tests/test_simple_engine.py

@Thump604
Copy link
Copy Markdown
Collaborator

Replacement restack is up at #285: https://github.com/waybarrios/vllm-mlx/pull/285\n\nThat branch carries the rebased tool-chat aggregation change on a writable branch and passed local validation here:\n- python -m py_compile vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- python -m black --check --fast vllm_mlx/engine/simple.py tests/test_simple_engine.py\n- pytest -q tests/test_simple_engine.py\n\nI’m using the replacement branch because the original computor-org fork could not be updated directly from this auth context.

@Thump604
Copy link
Copy Markdown
Collaborator

Closing as superseded by #285, which carries the rebased SimpleEngine tool-chat aggregation change on a writable branch and validated cleanly on current main. The original computor-org fork branch could not be updated directly from this auth context, so the merge path moved to the replacement PR.

@Thump604 Thump604 closed this Apr 11, 2026
janhilgard pushed a commit that referenced this pull request Apr 12, 2026
…er stream_generate (#266)

stream_generate() is the only code path that consumes per-request
SpecPrefill overrides (`specprefill`, `specprefill_keep_pct`) and
routes through _stream_generate_specprefill() when engaged. The prior
direct self._model.generate() path silently dropped those overrides:
server.py's create_completion() extracts them from extra_body and
forwards to engine.generate(), engine.generate() forwards via **kwargs
to _model.generate(), but _model.generate() (mlx_lm.generate) does not
consume them. Non-streaming /v1/completions clients that sent
`{"extra_body": {"specprefill": true}}` had their overrides silently
no-op'd.

Fix: make SimpleEngine.generate() a thin accumulator that iterates
self.stream_generate() and returns the last GenerationOutput. Matches
the pattern PR #222 established for tool-enabled chat(). Non-streaming
clients now get:

- SpecPrefill engagement when `specprefill=true` is set (top-level or
  extra_body fallback via whatever helper server.py uses)
- Accurate `prompt_tokens` reporting (the old path returned 0 because
  mlx_lm.generate never populates it)
- Chat-template and reasoning-parser behavior consistent with the
  streaming path
- Same thread-safety (stream_generate holds self._generation_lock
  around the MLX call)

Scope: only generate() changes. chat() stays on its current path;
extending chat() to the full accumulator pattern is a separate
follow-up on top of PR #222.

Tests:
- New test_generate_accumulates_over_stream_generate stubs
  stream_generate with an async generator, calls generate() with
  per-request specprefill kwargs, and asserts:
  * final output fields (text, tokens, prompt_tokens,
    completion_tokens, finish_reason, finished) match the last yielded
    chunk
  * specprefill / specprefill_keep_pct were forwarded through to
    stream_generate
- New test_generate_empty_stream_returns_safe_default covers the
  empty-stream edge case (returns GenerationOutput(text="",
  finish_reason="stop") rather than raising)
- Existing mock_model fixture extended with stream_generate tracking
  so test_lock_prevents_concurrent_generate still observes
  serialization through the new accumulator path

Verified live against Qwen3.5-4B SimpleEngine + SpecPrefill on M2
Ultra with a ~6K token prompt and extra_body.specprefill=true forcing
SpecPrefill below the 8192 threshold:

  SpecPrefill: scored 6007 tokens in 5.3s, sparse prefill 1815/6007 (keep=30%) in 1.1s

prompt_tokens reporting is now 6007 (was always 0 before).

Related: companion PR #265 (CompletionRequest schema + server-side
extract_body -> gen_kwargs threading) which opens the wire from
/v1/completions to engine.generate(). This PR closes the wire on the
engine side.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants