-
Notifications
You must be signed in to change notification settings - Fork 1.1k
ci: Add vLLM support to integration testing infrastructure #3128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
derekhiggins
wants to merge
4
commits into
llamastack:main
Choose a base branch
from
derekhiggins:vllm-ci-2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
ab59175
test: generate unique chat completion IDs for replayed responses
derekhiggins 84c6210
test: add models.list() recording/replay support
derekhiggins 0f0e9ca
ci: integrate vLLM inference tests with GitHub Actions workflows
derekhiggins c109f74
Recordings update from CI (vllm)
github-actions[bot] File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this situation in more detail? does this happen because we have both ollama and vllm in the same DB or some other reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we have some tests in the same test run using identical inference requests, (both using vllm)
then this happens they but use the same recorded request and get the same recorded chat-id
e.g. in the two variants of
the second one fails because the vllm provider stored the response in the DB with an ID then for the second the server errors with something like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@derekhiggins a bit confused -- this code is being changed during replay time, not recording time. The duplicate error would have happened during recording time right? How does this fix prevent that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During replay on the first test the recorded data is being retrieved with an ID and insrted into the chat_completions table here (I think)
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/core/routers/inference.py#L530
Then the second test comes along an tries to same with its response (which is the same recorded data) and fails as the response has the same ID as the previous test
because the id is a primary key and should be uniq
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/utils/inference/inference_store.py#L37
Looks like we don't hit this problem in ollama because the id is (re)created by the provider
https://github.com/meta-llama/llama-stack/blob/0cbd93c5cc44b790c5b08a2f827944c9ac3223d7/llama_stack/providers/remote/inference/ollama/ollama.py#L605
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I too have been seeing some weirdness with replay when it comes to chat completion ids, timestamps, etc. I wonder if something in the logic is slightly off here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should simply make inference_store robust to collisions with a
ON DUPLICATE IGNORE
kind of clause.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned this would result in LLS silently ignoring problems with the upstream API, wouldn't it be better to refuse to deal with duplicate id's?
If doing this (i.e.
ON DUPLICATE IGNORE
) , I don't see where it would be passed into the sqlstore, I guess its a new param that would need to be added to the insert API ?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe Stack should generate an ID for our own purposes (not use the ID we get from inference providers)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a reasonable solution @ehhuang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good change, I've so far avoided and changes to the implementation of the provider to accommodate record/replay , can I tackle this as a separate piece of work (update provider and remove this workaround) which I'll take on ASAP? if so I can open a new issue.