ci: Add vLLM support to integration testing infrastructure #3128

derekhiggins · 2025-08-13T13:43:05Z

o Introduces vLLM provider support to the record/replay testing framework
o Enabling both recording and replay of vLLM API interactions alongside existing Ollama support.

The changes enable testing of vLLM functionality. vLLM tests focus on
inference capabilities, while Ollama continues to exercise the full API surface
including vision features.

Related: #2888

ashwinb · 2025-08-13T14:17:50Z

llama_stack/testing/inference_recorder.py

+                and len(response_body) > 0
+                and isinstance(response_body[0], ChatCompletionChunk)
+            ):
+                # We can't replay chatcompletions with the same id and we store them in a sqlite database with a unique constraint on the id.


can you explain this situation in more detail? does this happen because we have both ollama and vllm in the same DB or some other reason?

No, we have some tests in the same test run using identical inference requests, (both using vllm)
then this happens they but use the same recorded request and get the same recorded chat-id

e.g. in the two variants of

tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=vllm/meta-llama/Llama-3.2-1B-Instruct-inference:chat_completion:non_streaming_01] PASSED tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=vllm/meta-llama/Llama-3.2-1B-Instruct-inference:chat_completion:non_streaming_01] FAILED

the second one fails because the vllm provider stored the response in the DB with an ID then for the second the server errors with something like

INFO 2025-08-13 15:41:00,486 console_span_processor:62 telemetry: 14:41:00.412 [ERROR] Error executing endpoint route='/v1/openai/v1/chat/completions' method='post': (sqlite3.IntegrityError) UNIQUE constraint failed: chat_completions.id [SQL: INSERT INTO chat_completions (id, created, model, choices, input_messages, access_attributes, owner_principal) VALUES (?, ?, ?, ?, ?, ?, ?)] [parameters: ('chatcmpl-1fda46f3388646e9a3bb7b079f8a8b68', 1755095964, 'meta-llama/Llama-3.2-1B-Instruct', '[{"finish_reason": "stop", "index": 0, "logprobs": null, "message": {"content": "Humans do not live on any planet. Humans live on Earth, which is the ... (223 characters truncated) ... "role": "assistant", "annotations": null, "audio": null, "function_call": null, "tool_calls": null, "reasoning_content": null}, "stop_reason": null}]', '[{"role": "user", "content": "Which planet do humans live on?", "name": null}]', 'null', None)]

@derekhiggins a bit confused -- this code is being changed during replay time, not recording time. The duplicate error would have happened during recording time right? How does this fix prevent that?

During replay on the first test the recorded data is being retrieved with an ID and insrted into the chat_completions table here (I think)
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/core/routers/inference.py#L530

Then the second test comes along an tries to same with its response (which is the same recorded data) and fails as the response has the same ID as the previous test
because the id is a primary key and should be uniq
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/utils/inference/inference_store.py#L37

Looks like we don't hit this problem in ollama because the id is (re)created by the provider
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/remote/inference/ollama/ollama.py#L605

I too have been seeing some weirdness with replay when it comes to chat completion ids, timestamps, etc. I wonder if something in the logic is slightly off here.

I think we should simply make inference_store robust to collisions with a ON DUPLICATE IGNORE kind of clause.

I'm concerned this would result in LLS silently ignoring problems with the upstream API, wouldn't it be better to refuse to deal with duplicate id's?

If doing this (i.e. ON DUPLICATE IGNORE ) , I don't see where it would be passed into the sqlstore, I guess its a new param that would need to be added to the insert API ?

Maybe Stack should generate an ID for our own purposes (not use the ID we get from inference providers)?

Yeah that's a reasonable solution @ehhuang

This seems like a good change, I've so far avoided and changes to the implementation of the provider to accommodate record/replay , can I tackle this as a separate piece of work (update provider and remove this workaround) which I'll take on ASAP? if so I can open a new issue.

llama_stack/testing/inference_recorder.py

ashwinb · 2025-08-15T16:53:01Z

llama_stack/testing/inference_recorder.py

+    def patched_model_list(self, *args, **kwargs):
+        # The original models.list() returns an AsyncPaginator that can be used with async for
+        # We need to create a wrapper that preserves this behavior
+        class PatchedAsyncPaginator:


is this still needed now that you are storing the chunks themselves? whichever way we replay back a streaming chat completion, that same way should work for this also?

Yes, this or something similar is needed, we need a object with __aiter__

ERROR 2025-08-18 10:52:35,247 llama_stack.core.routing_tables.models:36 core: Model refresh failed for provider vllm: 'async for' requires an object with __aiter__ method, got coroutine

self.client.models.list() returns something directly async-iterable so it can be used with async for model in models.list (it needs an object with aiter())
we also need the object to support await client.models.list() (it needs await())

I have tried multiple things with the _patched_inference_method but I haven't managed to re-use it

Best I can do (at least that I can see) is the most recent version which is a bit simpler

When replaying recorded chat completion responses, the original chat IDs cause conflicts due to SQLite unique constraints. Generate new UUIDs for both ChatCompletion and ChatCompletionChunk objects to ensure each replayed response has a unique identifier. This fixes test failures when running integration tests in replay mode with recorded chat completion responses.

o Add patching for OpenAI AsyncModels.list method to inference recorder o Create AsyncIterableModelsWrapper that supports both usage patterns: * Direct async iteration: async for m in client.models.list() * Await then iterate: res = await client.models.list(); async for m in res o Update streaming detection to handle AsyncPage objects from models.list o Preserve all existing recording/replay functionality for other endpoints Signed-off-by: Derek Higgins <[email protected]>

Add vLLM provider support to integration test CI workflows alongside existing Ollama support. Configure provider-specific test execution where vLLM runs only inference specific tests (excluding vision tests) while Ollama continues to run the full test suite. This enables comprehensive CI testing of both inference providers but keeps the vLLM footprint small, this can be expanded later if it proves to not be too disruptive. Signed-off-by: Derek Higgins <[email protected]>

derekhiggins · 2025-08-20T15:28:57Z

@ashwinb With this we'll need to run the record tests for 2 providers, but they can't be run in parrallel because

CONFLICT (content): Merge conflict in tests/integration/recordings/index.sqlite

it works if you run them sequentially

@ashwin to avoid conflicts what would you think about removing the index.sqlite file altogether?
From what I can see it is only used to get the path of the recording, and we can instead infer this from the request_hash
there are probably other places this file is going to cause conflicts so possibly its good to remove it anyways

derekhiggins requested review from ashwinb, yanxi0830, hardikjshah, raghotham, ehhuang, terrytangyuan, leseb, bbrowning, reluctantfuturist, mattf and slekkala1 as code owners August 13, 2025 13:43

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 13, 2025

ashwinb reviewed Aug 13, 2025

View reviewed changes

llama_stack/testing/inference_recorder.py Outdated Show resolved Hide resolved

ashwinb reviewed Aug 13, 2025

View reviewed changes

llama_stack/testing/inference_recorder.py Outdated Show resolved Hide resolved

derekhiggins force-pushed the vllm-ci-2 branch from 1f29aaa to 6a4da14 Compare August 15, 2025 12:06

ashwinb reviewed Aug 15, 2025

View reviewed changes

derekhiggins force-pushed the vllm-ci-2 branch 3 times, most recently from 19d1f6a to 338848c Compare August 20, 2025 14:47

derekhiggins added 3 commits August 20, 2025 16:11

derekhiggins force-pushed the vllm-ci-2 branch from 21f2737 to 0f0e9ca Compare August 20, 2025 15:12

Recordings update from CI (vllm)

c109f74

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: Add vLLM support to integration testing infrastructure #3128

ci: Add vLLM support to integration testing infrastructure #3128

Uh oh!

derekhiggins commented Aug 13, 2025

Uh oh!

ashwinb Aug 13, 2025

Uh oh!

derekhiggins Aug 13, 2025

Uh oh!

ashwinb Aug 13, 2025

Uh oh!

derekhiggins Aug 13, 2025

Uh oh!

cdoern Aug 13, 2025

Uh oh!

ashwinb Aug 13, 2025

Uh oh!

derekhiggins Aug 15, 2025

Uh oh!

ehhuang Aug 15, 2025

Uh oh!

ashwinb Aug 15, 2025

Uh oh!

derekhiggins Aug 18, 2025

Uh oh!

Uh oh!

Uh oh!

ashwinb Aug 15, 2025

Uh oh!

derekhiggins Aug 18, 2025

Uh oh!

derekhiggins commented Aug 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

ci: Add vLLM support to integration testing infrastructure #3128

Are you sure you want to change the base?

ci: Add vLLM support to integration testing infrastructure #3128

Uh oh!

Conversation

derekhiggins commented Aug 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

derekhiggins commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

derekhiggins commented Aug 20, 2025 •

edited

Loading