1
1
# Pooling Models
2
2
3
- vLLM also supports pooling models, including embedding, reranking and reward models.
3
+ vLLM also supports pooling models, such as embedding, classification and reward models.
4
4
5
5
In vLLM, pooling models implement the [ VllmModelForPooling] [ vllm.model_executor.models.VllmModelForPooling ] interface.
6
- These models use a [ Pooler] [ vllm.model_executor.layers.Pooler ] to extract the final hidden states of the input
6
+ These models use a [ Pooler] [ vllm.model_executor.layers.pooler. Pooler ] to extract the final hidden states of the input
7
7
before returning them.
8
8
9
9
!!! note
10
10
We currently support pooling models primarily as a matter of convenience.
11
11
As shown in the [ Compatibility Matrix] ( ../features/compatibility_matrix.md ) , most vLLM features are not applicable to
12
12
pooling models as they only work on the generation or decode stage, so performance may not improve as much.
13
13
14
- If the model doesn't implement this interface, you can set ` --task ` which tells vLLM
15
- to convert the model into a pooling model.
14
+ ## Configuration
16
15
17
- | ` --task ` | Model type | Supported pooling tasks |
18
- | ------------| ----------------------| -------------------------------|
19
- | ` embed ` | Embedding model | ` encode ` , ` embed ` |
20
- | ` classify ` | Classification model | ` encode ` , ` classify ` , ` score ` |
21
- | ` reward ` | Reward model | ` encode ` |
16
+ ### Model Runner
22
17
23
- ## Pooling Tasks
18
+ Run a model in pooling mode via the option ` --runner pooling ` .
24
19
25
- In vLLM, we define the following pooling tasks and corresponding APIs:
20
+ !!! tip
21
+ There is no need to set this option in the vast majority of cases as vLLM can automatically
22
+ detect the model runner to use via ` --runner auto ` .
23
+
24
+ ### Model Conversion
25
+
26
+ vLLM can adapt models for various pooling tasks via the option ` --convert <type> ` .
27
+
28
+ If ` --runner pooling ` has been set (manually or automatically) but the model does not implement the
29
+ [ VllmModelForPooling] [ vllm.model_executor.models.VllmModelForPooling ] interface,
30
+ vLLM will attempt to automatically convert the model according to the architecture names
31
+ shown in the table below.
32
+
33
+ | Architecture | ` --convert ` | Supported pooling tasks |
34
+ | -------------------------------------------------| -------------| -------------------------------|
35
+ | ` *ForTextEncoding ` , ` *EmbeddingModel ` , ` *Model ` | ` embed ` | ` encode ` , ` embed ` |
36
+ | ` *For*Classification ` , ` *ClassificationModel ` | ` classify ` | ` encode ` , ` classify ` , ` score ` |
37
+ | ` *ForRewardModeling ` , ` *RewardModel ` | ` reward ` | ` encode ` |
38
+
39
+ !!! tip
40
+ You can explicitly set ` --convert <type> ` to specify how to convert the model.
41
+
42
+ ### Pooling Tasks
43
+
44
+ Each pooling model in vLLM supports one or more of these tasks according to
45
+ [ Pooler.get_supported_tasks] [ vllm.model_executor.layers.pooler.Pooler.get_supported_tasks ] ,
46
+ enabling the corresponding APIs:
26
47
27
48
| Task | APIs |
28
49
| ------------| --------------------|
@@ -31,32 +52,32 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
31
52
| ` classify ` | ` classify ` |
32
53
| ` score ` | ` score ` |
33
54
34
- \* The ` score ` API falls back to ` embed ` task if the model does not support ` score ` task.
55
+ \* The ` score ` API falls back to ` embed ` task if the model does not support ` score ` task.
35
56
36
- Each pooling model in vLLM supports one or more of these tasks according to [ Pooler.get_supported_tasks ] [ vllm.model_executor.layers.Pooler.get_supported_tasks ] .
57
+ ### Pooler Configuration
37
58
38
- By default, the pooler assigned to each task has the following attributes:
59
+ #### Predefined models
60
+
61
+ If the [ Pooler] [ vllm.model_executor.layers.pooler.Pooler ] defined by the model accepts ` pooler_config ` ,
62
+ you can override some of its attributes via the ` --override-pooler-config ` option.
63
+
64
+ #### Converted models
65
+
66
+ If the model has been converted via ` --convert ` (see above),
67
+ the pooler assigned to each task has the following attributes by default:
39
68
40
69
| Task | Pooling Type | Normalization | Softmax |
41
70
| ------------| ----------------| ---------------| ---------|
42
71
| ` encode ` | ` ALL ` | ❌ | ❌ |
43
72
| ` embed ` | ` LAST ` | ✅︎ | ❌ |
44
73
| ` classify ` | ` LAST ` | ❌ | ✅︎ |
45
74
46
- These defaults may be overridden by the model's implementation in vLLM.
47
-
48
75
When loading [ Sentence Transformers] ( https://huggingface.co/sentence-transformers ) models,
49
- we attempt to override the defaults based on its Sentence Transformers configuration file (` modules.json ` ),
50
- which takes priority over the model's defaults.
76
+ its Sentence Transformers configuration file (` modules.json ` ) takes priority over the model's defaults.
51
77
52
78
You can further customize this via the ` --override-pooler-config ` option,
53
79
which takes priority over both the model's and Sentence Transformers's defaults.
54
80
55
- !!! note
56
-
57
- The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
58
- that is not based on [PoolerConfig][vllm.config.PoolerConfig].
59
-
60
81
## Offline Inference
61
82
62
83
The [ LLM] [ vllm.LLM ] class provides various methods for offline inference.
@@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
70
91
``` python
71
92
from vllm import LLM
72
93
73
- llm = LLM(model = " Qwen/Qwen2.5-Math-RM-72B" , task = " reward " )
94
+ llm = LLM(model = " Qwen/Qwen2.5-Math-RM-72B" , runner = " pooling " )
74
95
(output,) = llm.encode(" Hello, my name is" )
75
96
76
97
data = output.outputs.data
@@ -85,7 +106,7 @@ It is primarily designed for embedding models.
85
106
``` python
86
107
from vllm import LLM
87
108
88
- llm = LLM(model = " intfloat/e5-mistral-7b-instruct" , task = " embed " )
109
+ llm = LLM(model = " intfloat/e5-mistral-7b-instruct" , runner = " pooling " )
89
110
(output,) = llm.embed(" Hello, my name is" )
90
111
91
112
embeds = output.outputs.embedding
@@ -102,7 +123,7 @@ It is primarily designed for classification models.
102
123
``` python
103
124
from vllm import LLM
104
125
105
- llm = LLM(model = " jason9693/Qwen2.5-1.5B-apeach" , task = " classify " )
126
+ llm = LLM(model = " jason9693/Qwen2.5-1.5B-apeach" , runner = " pooling " )
106
127
(output,) = llm.classify(" Hello, my name is" )
107
128
108
129
probs = output.outputs.probs
@@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
123
144
``` python
124
145
from vllm import LLM
125
146
126
- llm = LLM(model = " BAAI/bge-reranker-v2-m3" , task = " score " )
147
+ llm = LLM(model = " BAAI/bge-reranker-v2-m3" , runner = " pooling " )
127
148
(output,) = llm.score(" What is the capital of France?" ,
128
149
" The capital of Brazil is Brasilia." )
129
150
@@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
175
196
from vllm import LLM , PoolingParams
176
197
177
198
llm = LLM(model = " jinaai/jina-embeddings-v3" ,
178
- task = " embed " ,
199
+ runner = " pooling " ,
179
200
trust_remote_code = True )
180
201
outputs = llm.embed([" Follow the white rabbit." ],
181
202
pooling_params = PoolingParams(dimensions = 32 ))
0 commit comments