Skip to content

ci: Add vLLM support to integration testing infrastructure #3128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

derekhiggins
Copy link
Contributor

o Introduces vLLM provider support to the record/replay testing framework
o Enabling both recording and replay of vLLM API interactions alongside existing Ollama support.

The changes enable testing of vLLM functionality. vLLM tests focus on
inference capabilities, while Ollama continues to exercise the full API surface
including vision features.

Related: #2888

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 13, 2025
and len(response_body) > 0
and isinstance(response_body[0], ChatCompletionChunk)
):
# We can't replay chatcompletions with the same id and we store them in a sqlite database with a unique constraint on the id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this situation in more detail? does this happen because we have both ollama and vllm in the same DB or some other reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we have some tests in the same test run using identical inference requests, (both using vllm)
then this happens they but use the same recorded request and get the same recorded chat-id

e.g. in the two variants of

tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=vllm/meta-llama/Llama-3.2-1B-Instruct-inference:chat_completion:non_streaming_01] PASSED
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=vllm/meta-llama/Llama-3.2-1B-Instruct-inference:chat_completion:non_streaming_01] FAILED                      

the second one fails because the vllm provider stored the response in the DB with an ID then for the second the server errors with something like

INFO     2025-08-13 15:41:00,486 console_span_processor:62 telemetry:  14:41:00.412 [ERROR] Error executing endpoint                                  
         route='/v1/openai/v1/chat/completions' method='post': (sqlite3.IntegrityError) UNIQUE constraint failed: chat_completions.id                 
         [SQL: INSERT INTO chat_completions (id, created, model, choices, input_messages, access_attributes, owner_principal) VALUES (?, ?, ?, ?, ?,  
         ?, ?)]                                                                                                                                       
         [parameters: ('chatcmpl-1fda46f3388646e9a3bb7b079f8a8b68', 1755095964, 'meta-llama/Llama-3.2-1B-Instruct', '[{"finish_reason": "stop",       
         "index": 0, "logprobs": null, "message": {"content": "Humans do not live on any planet. Humans live on Earth, which is the ... (223          
         characters truncated) ... "role": "assistant", "annotations": null, "audio": null, "function_call": null, "tool_calls": null,                
         "reasoning_content": null}, "stop_reason": null}]', '[{"role": "user", "content": "Which planet do humans live on?", "name": null}]', 'null',
         None)]                                                                                                                                       

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekhiggins a bit confused -- this code is being changed during replay time, not recording time. The duplicate error would have happened during recording time right? How does this fix prevent that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During replay on the first test the recorded data is being retrieved with an ID and insrted into the chat_completions table here (I think)
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/core/routers/inference.py#L530

Then the second test comes along an tries to same with its response (which is the same recorded data) and fails as the response has the same ID as the previous test
because the id is a primary key and should be uniq
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/utils/inference/inference_store.py#L37

Looks like we don't hit this problem in ollama because the id is (re)created by the provider
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/remote/inference/ollama/ollama.py#L605

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too have been seeing some weirdness with replay when it comes to chat completion ids, timestamps, etc. I wonder if something in the logic is slightly off here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should simply make inference_store robust to collisions with a ON DUPLICATE IGNORE kind of clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned this would result in LLS silently ignoring problems with the upstream API, wouldn't it be better to refuse to deal with duplicate id's?

If doing this (i.e. ON DUPLICATE IGNORE ) , I don't see where it would be passed into the sqlstore, I guess its a new param that would need to be added to the insert API ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Stack should generate an ID for our own purposes (not use the ID we get from inference providers)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a reasonable solution @ehhuang

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good change, I've so far avoided and changes to the implementation of the provider to accommodate record/replay , can I tackle this as a separate piece of work (update provider and remove this workaround) which I'll take on ASAP? if so I can open a new issue.

def patched_model_list(self, *args, **kwargs):
# The original models.list() returns an AsyncPaginator that can be used with async for
# We need to create a wrapper that preserves this behavior
class PatchedAsyncPaginator:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this still needed now that you are storing the chunks themselves? whichever way we replay back a streaming chat completion, that same way should work for this also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this or something similar is needed, we need a object with __aiter__

ERROR    2025-08-18 10:52:35,247 llama_stack.core.routing_tables.models:36 core: Model refresh failed for provider vllm: 'async for' requires an      
         object with __aiter__ method, got coroutine                                                                                                  

self.client.models.list() returns something directly async-iterable so it can be used with async for model in models.list (it needs an object with aiter())
we also need the object to support await client.models.list() (it needs await())

I have tried multiple things with the _patched_inference_method but I haven't managed to re-use it

Best I can do (at least that I can see) is the most recent version which is a bit simpler

@derekhiggins derekhiggins force-pushed the vllm-ci-2 branch 3 times, most recently from 19d1f6a to 338848c Compare August 20, 2025 14:47
When replaying recorded chat completion responses, the original chat IDs
cause conflicts due to SQLite unique constraints. Generate new UUIDs for
both ChatCompletion and ChatCompletionChunk objects to ensure each
replayed response has a unique identifier.

This fixes test failures when running integration tests in replay mode
with recorded chat completion responses.
o Add patching for OpenAI AsyncModels.list method to inference recorder
o Create AsyncIterableModelsWrapper that supports both usage patterns:
  * Direct async iteration: async for m in client.models.list()
  * Await then iterate: res = await client.models.list(); async for m in res
o Update streaming detection to handle AsyncPage objects from models.list
o Preserve all existing recording/replay functionality for other endpoints

Signed-off-by: Derek Higgins <[email protected]>
Add vLLM provider support to integration test CI workflows alongside
existing Ollama support. Configure provider-specific test execution
where vLLM runs only inference specific tests (excluding vision tests) while
Ollama continues to run the full test suite.

This enables comprehensive CI testing of both inference providers but
keeps the vLLM footprint small, this can be expanded later if it proves
to not be too disruptive.

Signed-off-by: Derek Higgins <[email protected]>
@derekhiggins
Copy link
Contributor Author

derekhiggins commented Aug 20, 2025

@ashwinb With this we'll need to run the record tests for 2 providers, but they can't be run in parrallel because

CONFLICT (content): Merge conflict in tests/integration/recordings/index.sqlite

it works if you run them sequentially

@ashwin to avoid conflicts what would you think about removing the index.sqlite file altogether?
From what I can see it is only used to get the path of the recording, and we can instead infer this from the request_hash
there are probably other places this file is going to cause conflicts so possibly its good to remove it anyways

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants