Skip to content

Commit beb2555

Browse files
DarkLight1337hmellor
authored andcommitted
[Deprecation][2/N] Replace --task with --runner and --convert (vllm-project#21470)
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Harry Mellor <[email protected]> Co-authored-by: Harry Mellor <[email protected]> Signed-off-by: Diego-Castan <[email protected]>
1 parent c0858f1 commit beb2555

File tree

94 files changed

+1111
-1077
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

94 files changed

+1111
-1077
lines changed

docs/features/multimodal_inputs.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -343,7 +343,7 @@ Here is a simple example using Phi-3.5-Vision.
343343
First, launch the OpenAI-compatible server:
344344

345345
```bash
346-
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
346+
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
347347
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
348348
```
349349

@@ -422,7 +422,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim
422422
First, launch the OpenAI-compatible server:
423423

424424
```bash
425-
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192
425+
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --runner generate --max-model-len 8192
426426
```
427427

428428
Then, you can use the OpenAI client as follows:

docs/features/prompt_embeds.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors.
3434
First, launch the OpenAI-compatible server:
3535

3636
```bash
37-
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
37+
vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
3838
--max-model-len 4096 --enable-prompt-embeds
3939
```
4040

docs/models/generative_models.md

Lines changed: 10 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,19 @@
22

33
vLLM provides first-class support for generative models, which covers most of LLMs.
44

5-
In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
5+
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
66
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
77
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.
88

9-
For generative models, the only supported `--task` option is `"generate"`.
10-
Usually, this is automatically inferred so you don't have to specify it.
9+
## Configuration
10+
11+
### Model Runner (`--runner`)
12+
13+
Run a model in generation mode via the option `--runner generate`.
14+
15+
!!! tip
16+
There is no need to set this option in the vast majority of cases as vLLM can automatically
17+
detect the model runner to use via `--runner auto`.
1118

1219
## Offline Inference
1320

docs/models/pooling_models.md

Lines changed: 49 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -1,28 +1,49 @@
11
# Pooling Models
22

3-
vLLM also supports pooling models, including embedding, reranking and reward models.
3+
vLLM also supports pooling models, such as embedding, classification and reward models.
44

55
In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
6-
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
6+
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
77
before returning them.
88

99
!!! note
1010
We currently support pooling models primarily as a matter of convenience.
1111
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
1212
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
1313

14-
If the model doesn't implement this interface, you can set `--task` which tells vLLM
15-
to convert the model into a pooling model.
14+
## Configuration
1615

17-
| `--task` | Model type | Supported pooling tasks |
18-
|------------|----------------------|-------------------------------|
19-
| `embed` | Embedding model | `encode`, `embed` |
20-
| `classify` | Classification model | `encode`, `classify`, `score` |
21-
| `reward` | Reward model | `encode` |
16+
### Model Runner
2217

23-
## Pooling Tasks
18+
Run a model in pooling mode via the option `--runner pooling`.
2419

25-
In vLLM, we define the following pooling tasks and corresponding APIs:
20+
!!! tip
21+
There is no need to set this option in the vast majority of cases as vLLM can automatically
22+
detect the model runner to use via `--runner auto`.
23+
24+
### Model Conversion
25+
26+
vLLM can adapt models for various pooling tasks via the option `--convert <type>`.
27+
28+
If `--runner pooling` has been set (manually or automatically) but the model does not implement the
29+
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
30+
vLLM will attempt to automatically convert the model according to the architecture names
31+
shown in the table below.
32+
33+
| Architecture | `--convert` | Supported pooling tasks |
34+
|-------------------------------------------------|-------------|-------------------------------|
35+
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
36+
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
37+
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |
38+
39+
!!! tip
40+
You can explicitly set `--convert <type>` to specify how to convert the model.
41+
42+
### Pooling Tasks
43+
44+
Each pooling model in vLLM supports one or more of these tasks according to
45+
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
46+
enabling the corresponding APIs:
2647

2748
| Task | APIs |
2849
|------------|--------------------|
@@ -31,32 +52,32 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
3152
| `classify` | `classify` |
3253
| `score` | `score` |
3354

34-
\*The `score` API falls back to `embed` task if the model does not support `score` task.
55+
\* The `score` API falls back to `embed` task if the model does not support `score` task.
3556

36-
Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].
57+
### Pooler Configuration
3758

38-
By default, the pooler assigned to each task has the following attributes:
59+
#### Predefined models
60+
61+
If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
62+
you can override some of its attributes via the `--override-pooler-config` option.
63+
64+
#### Converted models
65+
66+
If the model has been converted via `--convert` (see above),
67+
the pooler assigned to each task has the following attributes by default:
3968

4069
| Task | Pooling Type | Normalization | Softmax |
4170
|------------|----------------|---------------|---------|
4271
| `encode` | `ALL` |||
4372
| `embed` | `LAST` | ✅︎ ||
4473
| `classify` | `LAST` || ✅︎ |
4574

46-
These defaults may be overridden by the model's implementation in vLLM.
47-
4875
When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
49-
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`),
50-
which takes priority over the model's defaults.
76+
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.
5177

5278
You can further customize this via the `--override-pooler-config` option,
5379
which takes priority over both the model's and Sentence Transformers's defaults.
5480

55-
!!! note
56-
57-
The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
58-
that is not based on [PoolerConfig][vllm.config.PoolerConfig].
59-
6081
## Offline Inference
6182

6283
The [LLM][vllm.LLM] class provides various methods for offline inference.
@@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
7091
```python
7192
from vllm import LLM
7293

73-
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
94+
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling")
7495
(output,) = llm.encode("Hello, my name is")
7596

7697
data = output.outputs.data
@@ -85,7 +106,7 @@ It is primarily designed for embedding models.
85106
```python
86107
from vllm import LLM
87108

88-
llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
109+
llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling")
89110
(output,) = llm.embed("Hello, my name is")
90111

91112
embeds = output.outputs.embedding
@@ -102,7 +123,7 @@ It is primarily designed for classification models.
102123
```python
103124
from vllm import LLM
104125

105-
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
126+
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
106127
(output,) = llm.classify("Hello, my name is")
107128

108129
probs = output.outputs.probs
@@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
123144
```python
124145
from vllm import LLM
125146

126-
llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
147+
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
127148
(output,) = llm.score("What is the capital of France?",
128149
"The capital of Brazil is Brasilia.")
129150

@@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
175196
from vllm import LLM, PoolingParams
176197

177198
llm = LLM(model="jinaai/jina-embeddings-v3",
178-
task="embed",
199+
runner="pooling",
179200
trust_remote_code=True)
180201
outputs = llm.embed(["Follow the white rabbit."],
181202
pooling_params=PoolingParams(dimensions=32))

0 commit comments

Comments
 (0)