Difficulty understanding sequence length and context length #4621
Closed
NicolasDrapier
announced in
Q&A
Replies: 1 comment 1 reply
-
EDIT : I think it's the interface (https://github.com/mckaywrigley/chatbot-ui) I'm using that isn't sending the tokens correctly. I'll have a look on their github EDIT 2 : Nevermind, did not use the slider |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
Since this morning I've been trying to play with the Phi3-mini-128k model, which in theory should give a context length of about 128k tokens. vLLM finds the sequence correctly from the config.json as show below with the parameter
max_seq_len=131072
.However, it turns out that the model responds
<s>
when the sequence is 4k tokens, which is annoying. This token corresponds to the BOS_TOKEN. I've tried increasing the size of the sequence captured by the CUDA graphs and tried to increase themax-num-batched-tokens
to 131072, but nothing helps.I don't quite understand how to manage my parameters to achieve this sequence length. I'm using the docker version vllm-openai:v0.4.2 and here's my command:
One response:
So what's the way to get long prompts with vLLM?
Beta Was this translation helpful? Give feedback.
All reactions