Skip to content

[Deprecation][2/N] Replace --task with --runner and --convert #21470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Jul 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
719d650
[Frontend] Replace `--task` option with `--runner` and `--convert`
DarkLight1337 Jul 23, 2025
9d48e9b
Update
DarkLight1337 Jul 23, 2025
e0e1f24
Update
DarkLight1337 Jul 23, 2025
aff0874
Remove downstream usages of `model_config.task`
DarkLight1337 Jul 23, 2025
e94e121
Merge branch 'main' into rename-task
DarkLight1337 Jul 23, 2025
d7ec7ef
Fixes
DarkLight1337 Jul 24, 2025
628a9e7
Merge branch 'main' into rename-task
DarkLight1337 Jul 24, 2025
2852937
Update docs
DarkLight1337 Jul 24, 2025
d6fe44b
Simplify deprecation logic
DarkLight1337 Jul 24, 2025
24ca7ba
Fix default runner and conversion
DarkLight1337 Jul 24, 2025
4d08cd1
Update
DarkLight1337 Jul 24, 2025
7e17f2e
Fix
DarkLight1337 Jul 24, 2025
fad33e9
Update docs
DarkLight1337 Jul 24, 2025
bbaa245
Fix admonitions
DarkLight1337 Jul 24, 2025
7c1dc83
Merge branch 'main' into rename-task
DarkLight1337 Jul 24, 2025
52c300b
Improve docs
DarkLight1337 Jul 24, 2025
1c6a84e
Merge branch 'main' into rename-task
DarkLight1337 Jul 24, 2025
3dfcdb5
Update tests
DarkLight1337 Jul 24, 2025
4426f78
Update model resolution
DarkLight1337 Jul 24, 2025
b6d41bb
Try fix
DarkLight1337 Jul 24, 2025
c764da3
More fixes and cleanup
DarkLight1337 Jul 24, 2025
f78fb3d
Make the transformers test stricter
DarkLight1337 Jul 24, 2025
0c1266c
Reorder
DarkLight1337 Jul 24, 2025
631dfef
Remove local variable that's only used one time
hmellor Jul 25, 2025
2485ebe
Simplify conditions in `_normalize_arch`
hmellor Jul 25, 2025
775aa05
Return `normalized_arch` in `inspect_model_cls`
hmellor Jul 25, 2025
9047c90
Fix verify_and_update_config
DarkLight1337 Jul 25, 2025
30d18f2
Merge branch 'main' into rename-task
DarkLight1337 Jul 25, 2025
784e723
Update comment
DarkLight1337 Jul 25, 2025
330d0ba
Handle `*Model` explicitly
DarkLight1337 Jul 25, 2025
b50ac67
Update
DarkLight1337 Jul 25, 2025
da0a183
Handle ST models
DarkLight1337 Jul 25, 2025
2e66452
Update
DarkLight1337 Jul 25, 2025
6e89575
Fix
DarkLight1337 Jul 25, 2025
18f0a32
Remove task check
DarkLight1337 Jul 25, 2025
f397c91
Merge branch 'main' into rename-task
DarkLight1337 Jul 25, 2025
20b8215
Update test
DarkLight1337 Jul 25, 2025
b718c8b
Fix model resolution
DarkLight1337 Jul 25, 2025
164b05b
Try fix
DarkLight1337 Jul 26, 2025
fdcbda0
Always pass model config
DarkLight1337 Jul 26, 2025
b1c2118
Fixes
DarkLight1337 Jul 26, 2025
1b7e56c
Merge branch 'main' into rename-task
DarkLight1337 Jul 26, 2025
b96c651
Merge branch 'main' into rename-task
DarkLight1337 Jul 26, 2025
7f438d2
Fix pre-commit
DarkLight1337 Jul 26, 2025
397c0c7
Avoid checking imports
DarkLight1337 Jul 26, 2025
b3c2535
Fix
DarkLight1337 Jul 26, 2025
54b93ba
Try fix
DarkLight1337 Jul 27, 2025
803b494
Merge branch 'main' into rename-task
DarkLight1337 Jul 27, 2025
6c63bd0
Fix
DarkLight1337 Jul 27, 2025
61d5160
Update
DarkLight1337 Jul 27, 2025
1ecd6bd
Skip roberta seq cls for V1
DarkLight1337 Jul 27, 2025
8925fac
Fix transformers loading
DarkLight1337 Jul 27, 2025
cc999d3
Cleanup
DarkLight1337 Jul 27, 2025
11377a4
Don't load HF in registry test
DarkLight1337 Jul 27, 2025
af6498a
Fix model impl
DarkLight1337 Jul 27, 2025
741be47
Fix remaining test
DarkLight1337 Jul 28, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/features/multimodal_inputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ Here is a simple example using Phi-3.5-Vision.
First, launch the OpenAI-compatible server:

```bash
vllm serve microsoft/Phi-3.5-vision-instruct --task generate \
vllm serve microsoft/Phi-3.5-vision-instruct --runner generate \
--trust-remote-code --max-model-len 4096 --limit-mm-per-prompt '{"image":2}'
```

Expand Down Expand Up @@ -422,7 +422,7 @@ Instead of `image_url`, you can pass a video file via `video_url`. Here is a sim
First, launch the OpenAI-compatible server:

```bash
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --task generate --max-model-len 8192
vllm serve llava-hf/llava-onevision-qwen2-0.5b-ov-hf --runner generate --max-model-len 8192
```

Then, you can use the OpenAI client as follows:
Expand Down
2 changes: 1 addition & 1 deletion docs/features/prompt_embeds.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ Prompt embeddings are passed in as base64 encoded torch tensors.
First, launch the OpenAI-compatible server:

```bash
vllm serve meta-llama/Llama-3.2-1B-Instruct --task generate \
vllm serve meta-llama/Llama-3.2-1B-Instruct --runner generate \
Copy link
Contributor

@noooop noooop Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. I think it's unnecessary to set --runner generate in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is unnecessary. I just followed the original code which set the --task even though it's unnecessary as well

Copy link
Contributor

@noooop noooop Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about deleting these unnecessary --runner as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure whether the original author put them there on purpose, maybe we can address this in a follow-up PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

--max-model-len 4096 --enable-prompt-embeds
```

Expand Down
13 changes: 10 additions & 3 deletions docs/models/generative_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,19 @@

vLLM provides first-class support for generative models, which covers most of LLMs.

In vLLM, generative models implement the [VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
In vLLM, generative models implement the[VllmModelForTextGeneration][vllm.model_executor.models.VllmModelForTextGeneration] interface.
Based on the final hidden states of the input, these models output log probabilities of the tokens to generate,
which are then passed through [Sampler][vllm.model_executor.layers.Sampler] to obtain the final text.

For generative models, the only supported `--task` option is `"generate"`.
Usually, this is automatically inferred so you don't have to specify it.
## Configuration

### Model Runner (`--runner`)

Run a model in generation mode via the option `--runner generate`.

!!! tip
There is no need to set this option in the vast majority of cases as vLLM can automatically
detect the model runner to use via `--runner auto`.

## Offline Inference

Expand Down
77 changes: 49 additions & 28 deletions docs/models/pooling_models.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,49 @@
# Pooling Models

vLLM also supports pooling models, including embedding, reranking and reward models.
vLLM also supports pooling models, such as embedding, classification and reward models.

In vLLM, pooling models implement the [VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface.
These models use a [Pooler][vllm.model_executor.layers.Pooler] to extract the final hidden states of the input
These models use a [Pooler][vllm.model_executor.layers.pooler.Pooler] to extract the final hidden states of the input
before returning them.

!!! note
We currently support pooling models primarily as a matter of convenience.
As shown in the [Compatibility Matrix](../features/compatibility_matrix.md), most vLLM features are not applicable to
pooling models as they only work on the generation or decode stage, so performance may not improve as much.

If the model doesn't implement this interface, you can set `--task` which tells vLLM
to convert the model into a pooling model.
## Configuration

| `--task` | Model type | Supported pooling tasks |
|------------|----------------------|-------------------------------|
| `embed` | Embedding model | `encode`, `embed` |
| `classify` | Classification model | `encode`, `classify`, `score` |
| `reward` | Reward model | `encode` |
### Model Runner

## Pooling Tasks
Run a model in pooling mode via the option `--runner pooling`.

In vLLM, we define the following pooling tasks and corresponding APIs:
!!! tip
There is no need to set this option in the vast majority of cases as vLLM can automatically
detect the model runner to use via `--runner auto`.

### Model Conversion

vLLM can adapt models for various pooling tasks via the option `--convert <type>`.

If `--runner pooling` has been set (manually or automatically) but the model does not implement the
[VllmModelForPooling][vllm.model_executor.models.VllmModelForPooling] interface,
vLLM will attempt to automatically convert the model according to the architecture names
shown in the table below.

| Architecture | `--convert` | Supported pooling tasks |
|-------------------------------------------------|-------------|-------------------------------|
| `*ForTextEncoding`, `*EmbeddingModel`, `*Model` | `embed` | `encode`, `embed` |
| `*For*Classification`, `*ClassificationModel` | `classify` | `encode`, `classify`, `score` |
| `*ForRewardModeling`, `*RewardModel` | `reward` | `encode` |

!!! tip
You can explicitly set `--convert <type>` to specify how to convert the model.

### Pooling Tasks

Each pooling model in vLLM supports one or more of these tasks according to
[Pooler.get_supported_tasks][vllm.model_executor.layers.pooler.Pooler.get_supported_tasks],
enabling the corresponding APIs:

| Task | APIs |
|------------|--------------------|
Expand All @@ -31,32 +52,32 @@ In vLLM, we define the following pooling tasks and corresponding APIs:
| `classify` | `classify` |
| `score` | `score` |

\*The `score` API falls back to `embed` task if the model does not support `score` task.
\* The `score` API falls back to `embed` task if the model does not support `score` task.

Each pooling model in vLLM supports one or more of these tasks according to [Pooler.get_supported_tasks][vllm.model_executor.layers.Pooler.get_supported_tasks].
### Pooler Configuration

By default, the pooler assigned to each task has the following attributes:
#### Predefined models

If the [Pooler][vllm.model_executor.layers.pooler.Pooler] defined by the model accepts `pooler_config`,
you can override some of its attributes via the `--override-pooler-config` option.

#### Converted models

If the model has been converted via `--convert` (see above),
the pooler assigned to each task has the following attributes by default:

| Task | Pooling Type | Normalization | Softmax |
|------------|----------------|---------------|---------|
| `encode` | `ALL` | ❌ | ❌ |
| `embed` | `LAST` | ✅︎ | ❌ |
| `classify` | `LAST` | ❌ | ✅︎ |

These defaults may be overridden by the model's implementation in vLLM.

When loading [Sentence Transformers](https://huggingface.co/sentence-transformers) models,
we attempt to override the defaults based on its Sentence Transformers configuration file (`modules.json`),
which takes priority over the model's defaults.
its Sentence Transformers configuration file (`modules.json`) takes priority over the model's defaults.

You can further customize this via the `--override-pooler-config` option,
which takes priority over both the model's and Sentence Transformers's defaults.

!!! note

The above configuration may be disregarded if the model's implementation in vLLM defines its own pooler
that is not based on [PoolerConfig][vllm.config.PoolerConfig].

## Offline Inference

The [LLM][vllm.LLM] class provides various methods for offline inference.
Expand All @@ -70,7 +91,7 @@ It returns the extracted hidden states directly, which is useful for reward mode
```python
from vllm import LLM

llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", task="reward")
llm = LLM(model="Qwen/Qwen2.5-Math-RM-72B", runner="pooling")
(output,) = llm.encode("Hello, my name is")

data = output.outputs.data
Expand All @@ -85,7 +106,7 @@ It is primarily designed for embedding models.
```python
from vllm import LLM

llm = LLM(model="intfloat/e5-mistral-7b-instruct", task="embed")
llm = LLM(model="intfloat/e5-mistral-7b-instruct", runner="pooling")
(output,) = llm.embed("Hello, my name is")

embeds = output.outputs.embedding
Expand All @@ -102,7 +123,7 @@ It is primarily designed for classification models.
```python
from vllm import LLM

llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", task="classify")
llm = LLM(model="jason9693/Qwen2.5-1.5B-apeach", runner="pooling")
(output,) = llm.classify("Hello, my name is")

probs = output.outputs.probs
Expand All @@ -123,7 +144,7 @@ It is designed for embedding models and cross encoder models. Embedding models u
```python
from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
llm = LLM(model="BAAI/bge-reranker-v2-m3", runner="pooling")
(output,) = llm.score("What is the capital of France?",
"The capital of Brazil is Brasilia.")

Expand Down Expand Up @@ -175,7 +196,7 @@ You can change the output dimensions of embedding models that support Matryoshka
from vllm import LLM, PoolingParams

llm = LLM(model="jinaai/jina-embeddings-v3",
task="embed",
runner="pooling",
trust_remote_code=True)
outputs = llm.embed(["Follow the white rabbit."],
pooling_params=PoolingParams(dimensions=32))
Expand Down
Loading