Difficulty understanding sequence length and context length #4621

NicolasDrapier · 2024-05-06T10:18:25Z

NicolasDrapier
May 6, 2024

Hello,

Since this morning I've been trying to play with the Phi3-mini-128k model, which in theory should give a context length of about 128k tokens. vLLM finds the sequence correctly from the config.json as show below with the parameter max_seq_len=131072.

INFO 05-06 09:43:11 llm_engine.py:100] Initializing an LLM engine (v0.4.2) with config: model='/root/data/phi-3-mini-128k-instruct', 
speculative_config=None, tokenizer='/root/data/phi-3-mini-128k-instruct', 
skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, 
dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, 
load_format=LoadFormat.SAFETENSORS, tensor_parallel_size=4, disable_custom_all_reduce=False, 
quantization=None, enforce_eager=False, kv_cache_dtype=fp8, quantization_param_path=None, 
device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, 
served_model_name=phi3-128k)

However, it turns out that the model responds <s> when the sequence is 4k tokens, which is annoying. This token corresponds to the BOS_TOKEN. I've tried increasing the size of the sequence captured by the CUDA graphs and tried to increase the max-num-batched-tokens to 131072, but nothing helps.

I don't quite understand how to manage my parameters to achieve this sequence length. I'm using the docker version vllm-openai:v0.4.2 and here's my command:

docker run --rm -it --runtime nvidia \
-e CUDA_VISIBLE_DEVICES=4,5,6,7 \
-e NVIDIA_VISIBLE_DEVICES=4,5,6,7 \
--gpus all --shm-size 1g -p 8910:8000 \
-v /data/vllm/huggingface:/root/vllm/huggingface \
-v /data/phi-3-mini-128k-instruct:/root/data/phi-3-mini-128k-instruct \
--ipc=host \
--name vllm-2 \
vllm/vllm-openai:v0.4.2 \
--model /root/data/phi-3-mini-128k-instruct \
--served-model-name phi3-128k \
--load-format safetensors \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--device cuda \
--tensor-parallel-size 4 \
--max-seq_len-to-capture 128000 \
--disable_custom_all_reduce \
--trust-remote-code

One response:

INFO:     192.168.67.100:36680 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 05-06 09:44:38 async_llm_engine.py:529] Received request cmpl-148d91fce2294995922841d7bd97221a: 
prompt: '<s>', sampling_params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, 
frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.5, top_p=1.0,
 top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, 
early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False,
 ignore_eos=False, max_tokens=131070, min_tokens=0, logprobs=None, 
prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, 
truncate_prompt_tokens=None), prompt_token_ids: [1, 1], lora_request: None.

So what's the way to get long prompts with vLLM?

NicolasDrapier · 2024-05-06T13:14:10Z

NicolasDrapier
May 6, 2024
Author

EDIT : I think it's the interface (https://github.com/mckaywrigley/chatbot-ui) I'm using that isn't sending the tokens correctly. I'll have a look on their github

EDIT 2 : Nevermind, did not use the slider

1 reply

aurotripathy Jul 1, 2025

I came to this page to better understand --max-seq_len-to-capture.
Were you referring to the slider in the UI (which is set to 4K and you have explicitly change)?
Is my understanding correct that the model can handle (meaning its trained under) a context length of 128K.
So any max-seq_len-to-capture upto 128K should have no performance impact (like on TTFT, token rate etc).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Difficulty understanding sequence length and context length #4621

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Difficulty understanding sequence length and context length #4621

Uh oh!

Uh oh!

NicolasDrapier May 6, 2024

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

NicolasDrapier May 6, 2024 Author

Uh oh!

Uh oh!

aurotripathy Jul 1, 2025

NicolasDrapier
May 6, 2024

Replies: 1 comment 1 reply

NicolasDrapier
May 6, 2024
Author