-
Notifications
You must be signed in to change notification settings - Fork 1.1k
ci: Add vLLM support to integration testing infrastructure #3128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
and len(response_body) > 0 | ||
and isinstance(response_body[0], ChatCompletionChunk) | ||
): | ||
# We can't replay chatcompletions with the same id and we store them in a sqlite database with a unique constraint on the id. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this situation in more detail? does this happen because we have both ollama and vllm in the same DB or some other reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we have some tests in the same test run using identical inference requests, (both using vllm)
then this happens they but use the same recorded request and get the same recorded chat-id
e.g. in the two variants of
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=vllm/meta-llama/Llama-3.2-1B-Instruct-inference:chat_completion:non_streaming_01] PASSED
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=vllm/meta-llama/Llama-3.2-1B-Instruct-inference:chat_completion:non_streaming_01] FAILED
the second one fails because the vllm provider stored the response in the DB with an ID then for the second the server errors with something like
INFO 2025-08-13 15:41:00,486 console_span_processor:62 telemetry: 14:41:00.412 [ERROR] Error executing endpoint
route='/v1/openai/v1/chat/completions' method='post': (sqlite3.IntegrityError) UNIQUE constraint failed: chat_completions.id
[SQL: INSERT INTO chat_completions (id, created, model, choices, input_messages, access_attributes, owner_principal) VALUES (?, ?, ?, ?, ?,
?, ?)]
[parameters: ('chatcmpl-1fda46f3388646e9a3bb7b079f8a8b68', 1755095964, 'meta-llama/Llama-3.2-1B-Instruct', '[{"finish_reason": "stop",
"index": 0, "logprobs": null, "message": {"content": "Humans do not live on any planet. Humans live on Earth, which is the ... (223
characters truncated) ... "role": "assistant", "annotations": null, "audio": null, "function_call": null, "tool_calls": null,
"reasoning_content": null}, "stop_reason": null}]', '[{"role": "user", "content": "Which planet do humans live on?", "name": null}]', 'null',
None)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekhiggins a bit confused -- this code is being changed during replay time, not recording time. The duplicate error would have happened during recording time right? How does this fix prevent that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During replay on the first test the recorded data is being retrieved with an ID and insrted into the chat_completions table here (I think)
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/core/routers/inference.py#L530
Then the second test comes along an tries to same with its response (which is the same recorded data) and fails as the response has the same ID as the previous test
because the id is a primary key and should be uniq
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/utils/inference/inference_store.py#L37
Looks like we don't hit this problem in ollama because the id is (re)created by the provider
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/remote/inference/ollama/ollama.py#L605
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too have been seeing some weirdness with replay when it comes to chat completion ids, timestamps, etc. I wonder if something in the logic is slightly off here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should simply make inference_store robust to collisions with a ON DUPLICATE IGNORE
kind of clause.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned this would result in LLS silently ignoring problems with the upstream API, wouldn't it be better to refuse to deal with duplicate id's?
If doing this (i.e. ON DUPLICATE IGNORE
) , I don't see where it would be passed into the sqlstore, I guess its a new param that would need to be added to the insert API ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Stack should generate an ID for our own purposes (not use the ID we get from inference providers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a reasonable solution @ehhuang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good change, I've so far avoided and changes to the implementation of the provider to accommodate record/replay , can I tackle this as a separate piece of work (update provider and remove this workaround) which I'll take on ASAP? if so I can open a new issue.
1f29aaa
to
6a4da14
Compare
def patched_model_list(self, *args, **kwargs): | ||
# The original models.list() returns an AsyncPaginator that can be used with async for | ||
# We need to create a wrapper that preserves this behavior | ||
class PatchedAsyncPaginator: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this still needed now that you are storing the chunks themselves? whichever way we replay back a streaming chat completion, that same way should work for this also?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this or something similar is needed, we need a object with __aiter__
ERROR 2025-08-18 10:52:35,247 llama_stack.core.routing_tables.models:36 core: Model refresh failed for provider vllm: 'async for' requires an
object with __aiter__ method, got coroutine
self.client.models.list() returns something directly async-iterable so it can be used with async for model in models.list
(it needs an object with aiter())
we also need the object to support await client.models.list()
(it needs await())
I have tried multiple things with the _patched_inference_method
but I haven't managed to re-use it
Best I can do (at least that I can see) is the most recent version which is a bit simpler
19d1f6a
to
338848c
Compare
When replaying recorded chat completion responses, the original chat IDs cause conflicts due to SQLite unique constraints. Generate new UUIDs for both ChatCompletion and ChatCompletionChunk objects to ensure each replayed response has a unique identifier. This fixes test failures when running integration tests in replay mode with recorded chat completion responses.
o Add patching for OpenAI AsyncModels.list method to inference recorder o Create AsyncIterableModelsWrapper that supports both usage patterns: * Direct async iteration: async for m in client.models.list() * Await then iterate: res = await client.models.list(); async for m in res o Update streaming detection to handle AsyncPage objects from models.list o Preserve all existing recording/replay functionality for other endpoints Signed-off-by: Derek Higgins <[email protected]>
Add vLLM provider support to integration test CI workflows alongside existing Ollama support. Configure provider-specific test execution where vLLM runs only inference specific tests (excluding vision tests) while Ollama continues to run the full test suite. This enables comprehensive CI testing of both inference providers but keeps the vLLM footprint small, this can be expanded later if it proves to not be too disruptive. Signed-off-by: Derek Higgins <[email protected]>
21f2737
to
0f0e9ca
Compare
@ashwinb With this we'll need to run the record tests for 2 providers, but they can't be run in parrallel because
it works if you run them sequentially @ashwin to avoid conflicts what would you think about removing the index.sqlite file altogether? |
o Introduces vLLM provider support to the record/replay testing framework
o Enabling both recording and replay of vLLM API interactions alongside existing Ollama support.
The changes enable testing of vLLM functionality. vLLM tests focus on
inference capabilities, while Ollama continues to exercise the full API surface
including vision features.
Related: #2888