Skip to content

ci: Add vLLM support to integration testing infrastructure #3128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions .github/actions/run-and-record-tests/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,9 @@ runs:
git add tests/integration/recordings/

if [ "${{ inputs.run-vision-tests }}" == "true" ]; then
git commit -m "Recordings update from CI (vision)"
git commit -m "Recordings update from CI (vision) (${{ inputs.provider }})"
else
git commit -m "Recordings update from CI"
git commit -m "Recordings update from CI (${{ inputs.provider }})"
fi

git fetch origin ${{ github.ref_name }}
Expand All @@ -76,7 +76,8 @@ runs:
if: ${{ always() }}
shell: bash
run: |
sudo docker logs ollama > ollama-${{ inputs.inference-mode }}.log || true
sudo docker logs ollama > ollama-${{ inputs.inference-mode }}.log 2>&1 || true
sudo docker logs vllm > vllm-${{ inputs.inference-mode }}.log 2>&1 || true

- name: Upload logs
if: ${{ always() }}
Expand Down
7 changes: 4 additions & 3 deletions .github/workflows/integration-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ on:
schedule:
# If changing the cron schedule, update the provider in the test-matrix job
- cron: '0 0 * * *' # (test latest client) Daily at 12 AM UTC
- cron: '1 0 * * 0' # (test vllm) Weekly on Sunday at 1 AM UTC
workflow_dispatch:
inputs:
test-all-client-versions:
Expand All @@ -47,7 +46,6 @@ concurrency:
cancel-in-progress: true

jobs:

run-replay-mode-tests:
runs-on: ubuntu-latest
name: ${{ format('Integration Tests ({0}, {1}, {2}, client={3}, vision={4})', matrix.client-type, matrix.provider, matrix.python-version, matrix.client-version, matrix.run-vision-tests) }}
Expand All @@ -57,11 +55,14 @@ jobs:
matrix:
client-type: [library, server]
# Use vllm on weekly schedule, otherwise use test-provider input (defaults to ollama)
provider: ${{ (github.event.schedule == '1 0 * * 0') && fromJSON('["vllm"]') || fromJSON(format('["{0}"]', github.event.inputs.test-provider || 'ollama')) }}
provider: [ollama, vllm]
# Use Python 3.13 only on nightly schedule (daily latest client test), otherwise use 3.12
python-version: ${{ github.event.schedule == '0 0 * * *' && fromJSON('["3.12", "3.13"]') || fromJSON('["3.12"]') }}
client-version: ${{ (github.event.schedule == '0 0 * * *' || github.event.inputs.test-all-client-versions == 'true') && fromJSON('["published", "latest"]') || fromJSON('["latest"]') }}
run-vision-tests: [true, false]
exclude:
- provider: vllm
run-vision-tests: true

steps:
- name: Checkout repository
Expand Down
59 changes: 57 additions & 2 deletions llama_stack/testing/inference_recorder.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,16 @@
import json
import os
import sqlite3
import uuid
from collections.abc import Generator
from contextlib import contextmanager
from enum import StrEnum
from pathlib import Path
from typing import Any, Literal, cast

from openai.pagination import AsyncPage
from openai.types.chat import ChatCompletion, ChatCompletionChunk

from llama_stack.log import get_logger

logger = get_logger(__name__, category="testing")
Expand Down Expand Up @@ -248,6 +252,20 @@ async def _patched_inference_method(original_method, self, client_type, endpoint
recording = _current_storage.find_recording(request_hash)
if recording:
response_body = recording["response"]["body"]
if (
isinstance(response_body, list)
and len(response_body) > 0
and isinstance(response_body[0], ChatCompletionChunk)
):
# We can't replay chatcompletions with the same id and we store them in a sqlite database with a unique constraint on the id.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this situation in more detail? does this happen because we have both ollama and vllm in the same DB or some other reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we have some tests in the same test run using identical inference requests, (both using vllm)
then this happens they but use the same recorded request and get the same recorded chat-id

e.g. in the two variants of

tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[openai_client-txt=vllm/meta-llama/Llama-3.2-1B-Instruct-inference:chat_completion:non_streaming_01] PASSED
tests/integration/inference/test_openai_completion.py::test_openai_chat_completion_non_streaming[client_with_models-txt=vllm/meta-llama/Llama-3.2-1B-Instruct-inference:chat_completion:non_streaming_01] FAILED                      

the second one fails because the vllm provider stored the response in the DB with an ID then for the second the server errors with something like

INFO     2025-08-13 15:41:00,486 console_span_processor:62 telemetry:  14:41:00.412 [ERROR] Error executing endpoint                                  
         route='/v1/openai/v1/chat/completions' method='post': (sqlite3.IntegrityError) UNIQUE constraint failed: chat_completions.id                 
         [SQL: INSERT INTO chat_completions (id, created, model, choices, input_messages, access_attributes, owner_principal) VALUES (?, ?, ?, ?, ?,  
         ?, ?)]                                                                                                                                       
         [parameters: ('chatcmpl-1fda46f3388646e9a3bb7b079f8a8b68', 1755095964, 'meta-llama/Llama-3.2-1B-Instruct', '[{"finish_reason": "stop",       
         "index": 0, "logprobs": null, "message": {"content": "Humans do not live on any planet. Humans live on Earth, which is the ... (223          
         characters truncated) ... "role": "assistant", "annotations": null, "audio": null, "function_call": null, "tool_calls": null,                
         "reasoning_content": null}, "stop_reason": null}]', '[{"role": "user", "content": "Which planet do humans live on?", "name": null}]', 'null',
         None)]                                                                                                                                       

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekhiggins a bit confused -- this code is being changed during replay time, not recording time. The duplicate error would have happened during recording time right? How does this fix prevent that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During replay on the first test the recorded data is being retrieved with an ID and insrted into the chat_completions table here (I think)
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/core/routers/inference.py#L530

Then the second test comes along an tries to same with its response (which is the same recorded data) and fails as the response has the same ID as the previous test
because the id is a primary key and should be uniq
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/utils/inference/inference_store.py#L37

Looks like we don't hit this problem in ollama because the id is (re)created by the provider
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/remote/inference/ollama/ollama.py#L605

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I too have been seeing some weirdness with replay when it comes to chat completion ids, timestamps, etc. I wonder if something in the logic is slightly off here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should simply make inference_store robust to collisions with a ON DUPLICATE IGNORE kind of clause.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned this would result in LLS silently ignoring problems with the upstream API, wouldn't it be better to refuse to deal with duplicate id's?

If doing this (i.e. ON DUPLICATE IGNORE ) , I don't see where it would be passed into the sqlstore, I guess its a new param that would need to be added to the insert API ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe Stack should generate an ID for our own purposes (not use the ID we get from inference providers)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a reasonable solution @ehhuang

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good change, I've so far avoided and changes to the implementation of the provider to accommodate record/replay , can I tackle this as a separate piece of work (update provider and remove this workaround) which I'll take on ASAP? if so I can open a new issue.

# So we generate a new id and replace the old one.
newid = uuid.uuid4().hex
response_body[0].id = "chatcmpl-" + newid
elif isinstance(response_body, ChatCompletion):
# We can't replay chatcompletions with the same id and we store them in a sqlite database with a unique constraint on the id.
# So we generate a new id and replace the old one.
newid = uuid.uuid4().hex
response_body.id = "chatcmpl-" + newid

if recording["response"].get("is_streaming", False):

Expand Down Expand Up @@ -279,7 +297,8 @@ async def replay_stream():
}

# Determine if this is a streaming request based on request parameters
is_streaming = body.get("stream", False)
# or if the response is an AsyncPage (like models.list returns)
is_streaming = body.get("stream", False) or isinstance(response, AsyncPage)

if is_streaming:
# For streaming responses, we need to collect all chunks immediately before yielding
Expand Down Expand Up @@ -315,9 +334,11 @@ def patch_inference_clients():
from openai.resources.chat.completions import AsyncCompletions as AsyncChatCompletions
from openai.resources.completions import AsyncCompletions
from openai.resources.embeddings import AsyncEmbeddings
from openai.resources.models import AsyncModels

# Store original methods for both OpenAI and Ollama clients
_original_methods = {
"models_list": AsyncModels.list,
"chat_completions_create": AsyncChatCompletions.create,
"completions_create": AsyncCompletions.create,
"embeddings_create": AsyncEmbeddings.create,
Expand All @@ -329,7 +350,38 @@ def patch_inference_clients():
"ollama_list": OllamaAsyncClient.list,
}

# Create patched methods for OpenAI client
# Special handling for models.list which needs to return something directly async-iterable
# Direct iteration: async for m in client.models.list()
# Await then iterate: res = await client.models.list(); async for m in res
def patched_models_list(self, *args, **kwargs):
class AsyncIterableModelsWrapper:
def __init__(self, original_method, client_self, args, kwargs):
self.original_method = original_method
self.client_self = client_self
self.args = args
self.kwargs = kwargs
self._result = None

def __aiter__(self):
return self._async_iter()

async def _async_iter(self):
# Get the result from the patched method
result = await _patched_inference_method(
self.original_method, self.client_self, "openai", "/v1/models", *self.args, **self.kwargs
)
async for item in result:
yield item

def __await__(self):
# When awaited, return self (since we're already async-iterable)
async def _return_self():
return self

return _return_self().__await__()

return AsyncIterableModelsWrapper(_original_methods["models_list"], self, args, kwargs)

async def patched_chat_completions_create(self, *args, **kwargs):
return await _patched_inference_method(
_original_methods["chat_completions_create"], self, "openai", "/v1/chat/completions", *args, **kwargs
Expand All @@ -346,6 +398,7 @@ async def patched_embeddings_create(self, *args, **kwargs):
)

# Apply OpenAI patches
AsyncModels.list = patched_models_list
AsyncChatCompletions.create = patched_chat_completions_create
AsyncCompletions.create = patched_completions_create
AsyncEmbeddings.create = patched_embeddings_create
Expand Down Expand Up @@ -402,8 +455,10 @@ def unpatch_inference_clients():
from openai.resources.chat.completions import AsyncCompletions as AsyncChatCompletions
from openai.resources.completions import AsyncCompletions
from openai.resources.embeddings import AsyncEmbeddings
from openai.resources.models import AsyncModels

# Restore OpenAI client methods
AsyncModels.list = _original_methods["models_list"]
AsyncChatCompletions.create = _original_methods["chat_completions_create"]
AsyncCompletions.create = _original_methods["completions_create"]
AsyncEmbeddings.create = _original_methods["embeddings_create"]
Expand Down
4 changes: 2 additions & 2 deletions scripts/integration-tests.sh
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@ EXCLUDE_TESTS="builtin_tool or safety_with_image or code_interpreter or test_rag

# Additional exclusions for vllm provider
if [[ "$PROVIDER" == "vllm" ]]; then
EXCLUDE_TESTS="${EXCLUDE_TESTS} or test_inference_store_tool_calls"
EXCLUDE_TESTS="${EXCLUDE_TESTS} or test_inference_store_tool_calls or test_text_chat_completion_structured_output"
fi

PYTEST_PATTERN="not( $EXCLUDE_TESTS )"
Expand Down Expand Up @@ -240,7 +240,7 @@ TEST_FILES=""
for test_subdir in $(echo "$TEST_SUBDIRS" | tr ',' '\n'); do
# Skip certain test types for vllm provider
if [[ "$PROVIDER" == "vllm" ]]; then
if [[ "$test_subdir" == "safety" ]] || [[ "$test_subdir" == "post_training" ]] || [[ "$test_subdir" == "tool_runtime" ]]; then
if [[ "$test_subdir" == "safety" ]] || [[ "$test_subdir" == "post_training" ]] || [[ "$test_subdir" == "tool_runtime" ]] || [[ "$test_subdir" == "agents" ]]; then
echo "Skipping $test_subdir for vllm provider"
continue
fi
Expand Down
Binary file modified tests/integration/recordings/index.sqlite
Binary file not shown.
Loading