Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
719d650
[Frontend] Replace `--task` option with `--runner` and `--convert`
DarkLight1337 Jul 23, 2025
9d48e9b
Update
DarkLight1337 Jul 23, 2025
e0e1f24
Update
DarkLight1337 Jul 23, 2025
aff0874
Remove downstream usages of `model_config.task`
DarkLight1337 Jul 23, 2025
e94e121
Merge branch 'main' into rename-task
DarkLight1337 Jul 23, 2025
d7ec7ef
Fixes
DarkLight1337 Jul 24, 2025
628a9e7
Merge branch 'main' into rename-task
DarkLight1337 Jul 24, 2025
2852937
Update docs
DarkLight1337 Jul 24, 2025
d6fe44b
Simplify deprecation logic
DarkLight1337 Jul 24, 2025
24ca7ba
Fix default runner and conversion
DarkLight1337 Jul 24, 2025
4d08cd1
Update
DarkLight1337 Jul 24, 2025
7e17f2e
Fix
DarkLight1337 Jul 24, 2025
fad33e9
Update docs
DarkLight1337 Jul 24, 2025
bbaa245
Fix admonitions
DarkLight1337 Jul 24, 2025
7c1dc83
Merge branch 'main' into rename-task
DarkLight1337 Jul 24, 2025
52c300b
Improve docs
DarkLight1337 Jul 24, 2025
1c6a84e
Merge branch 'main' into rename-task
DarkLight1337 Jul 24, 2025
3dfcdb5
Update tests
DarkLight1337 Jul 24, 2025
4426f78
Update model resolution
DarkLight1337 Jul 24, 2025
b6d41bb
Try fix
DarkLight1337 Jul 24, 2025
c764da3
More fixes and cleanup
DarkLight1337 Jul 24, 2025
f78fb3d
Make the transformers test stricter
DarkLight1337 Jul 24, 2025
0c1266c
Reorder
DarkLight1337 Jul 24, 2025
631dfef
Remove local variable that's only used one time
hmellor Jul 25, 2025
2485ebe
Simplify conditions in `_normalize_arch`
hmellor Jul 25, 2025
775aa05
Return `normalized_arch` in `inspect_model_cls`
hmellor Jul 25, 2025
9047c90
Fix verify_and_update_config
DarkLight1337 Jul 25, 2025
30d18f2
Merge branch 'main' into rename-task
DarkLight1337 Jul 25, 2025
784e723
Update comment
DarkLight1337 Jul 25, 2025
330d0ba
Handle `*Model` explicitly
DarkLight1337 Jul 25, 2025
b50ac67
Update
DarkLight1337 Jul 25, 2025
da0a183
Handle ST models
DarkLight1337 Jul 25, 2025
2e66452
Update
DarkLight1337 Jul 25, 2025
6e89575
Fix
DarkLight1337 Jul 25, 2025
18f0a32
Remove task check
DarkLight1337 Jul 25, 2025
f397c91
Merge branch 'main' into rename-task
DarkLight1337 Jul 25, 2025
20b8215
Update test
DarkLight1337 Jul 25, 2025
b718c8b
Fix model resolution
DarkLight1337 Jul 25, 2025
164b05b
Try fix
DarkLight1337 Jul 26, 2025
fdcbda0
Always pass model config
DarkLight1337 Jul 26, 2025
b1c2118
Fixes
DarkLight1337 Jul 26, 2025
1b7e56c
Merge branch 'main' into rename-task
DarkLight1337 Jul 26, 2025
b96c651
Merge branch 'main' into rename-task
DarkLight1337 Jul 26, 2025
7f438d2
Fix pre-commit
DarkLight1337 Jul 26, 2025
397c0c7
Avoid checking imports
DarkLight1337 Jul 26, 2025
b3c2535
Fix
DarkLight1337 Jul 26, 2025
54b93ba
Try fix
DarkLight1337 Jul 27, 2025
803b494
Merge branch 'main' into rename-task
DarkLight1337 Jul 27, 2025
6c63bd0
Fix
DarkLight1337 Jul 27, 2025
61d5160
Update
DarkLight1337 Jul 27, 2025
1ecd6bd
Skip roberta seq cls for V1
DarkLight1337 Jul 27, 2025
8925fac
Fix transformers loading
DarkLight1337 Jul 27, 2025
cc999d3
Cleanup
DarkLight1337 Jul 27, 2025
11377a4
Don't load HF in registry test
DarkLight1337 Jul 27, 2025
af6498a
Fix model impl
DarkLight1337 Jul 27, 2025
741be47
Fix remaining test
DarkLight1337 Jul 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/features/multimodal_inputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ Here is a simple example using Phi-3.5-Vision.
First, launch the OpenAI-compatible server:

```bash
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
```

Expand Down Expand Up @@ -358,7 +358,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim
First, launch the OpenAI-compatible server:

```bash
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --runner generate --max-model-len 8192
```

Then, you can use the OpenAI client as follows:
Expand Down
2 changes: 1 addition & 1 deletion docs/features/prompt_embeds.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors.
First, launch the OpenAI-compatible server:

```bash
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
Copy link
Contributor

@noooop noooop Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. I think it's unnecessary to set --runner generate in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is unnecessary. I just followed the original code which set the --task even though it's unnecessary as well

Copy link
Contributor

@noooop noooop Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about deleting these unnecessary --runner as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether the original author put them there on purpose, maybe we can address this in a follow-up PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

--max-model-len 4096 --enable-prompt-embeds
```

Expand Down
4 changes: 2 additions & 2 deletions docs/models/generative_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.

For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
For model architectures that support both generation and pooling, you should set `--runner generate`
to use the model as a generative model.

## Offline Inference

Expand Down
31 changes: 17 additions & 14 deletions docs/models/pooling_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,25 @@
vLLM also supports pooling models, including embedding, reranking and reward models.

In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
before returning them.

!!! note
We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.

If the model doesn't implement this interface, you can set `--task` which tells vLLM
If the model doesn't implement this interface, you can set `--convert` which tells vLLM
to convert the model into a pooling model.

| `--task` | Model type | Supported pooling tasks |
|------------|----------------------|-------------------------------|
| `embed` | Embedding model | `encode`, `embed` |
| `classify` | Classification model | `encode`, `classify`, `score` |
| `reward` | Reward model | `encode` |
| `--convert` | Model type | Supported pooling tasks |
|-------------|----------------------|-------------------------------|
| `embed` | Embedding model | `encode`, `embed` |
| `classify` | Classification model | `encode`, `classify`, `score` |
| `reward` | Reward model | `encode` |

For model architectures that support both generation and pooling, you should set `--runner pooling`
to use the model as a pooling model.

## Pooling Tasks

Expand All @@ -31,9 +34,9 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
| `classify` | `classify` |
| `score` | `score` |

\*The `score` API falls back to `embed` task if the model does not support `score` task.
\* The `score` API falls back to `embed` task if the model does not support `score` task.

Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].
Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks].

By default, the pooler assigned to each task has the following attributes:

Expand Down Expand Up @@ -70,7 +73,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
```python
from vllm import LLM

llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling")
(output,) = llm.encode("Hello, my name is")

data = output.outputs.data
Expand All @@ -85,7 +88,7 @@ It is primarily designed for embedding models.
```python
from vllm import LLM

llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
Expand All @@ -102,7 +105,7 @@ It is primarily designed for classification models.
```python
from vllm import LLM

llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
Expand All @@ -123,7 +126,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
```python
from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score("What is the capital of France?",
"The capital of Brazil is Brasilia.")

Expand Down Expand Up @@ -175,7 +178,7 @@ You can change the output dimensions of embedding models that support Matryoshka
from vllm import LLM, PoolingParams

llm = LLM(model="jinaai/jina-embeddings-v3",
task="embed",
runner="pooling",
trust_remote_code=True)
outputs = llm.embed(["Follow the white rabbit."],
pooling_params=PoolingParams(dimensions=32))
Expand Down
Loading