Skip to content

[Frontend] Add chunked processing to handle long inputs in embedding models #22280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Aug 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
9ebc61b
The latest update introduces new text embedding examples and service …
x22x22 Aug 5, 2025
cab8200
修复合并多模态处理器参数的逻辑,确保正确合并传入的参数。更新了相关文件以使用新的合并方式。
x22x22 Aug 5, 2025
57987aa
restore
x22x22 Aug 5, 2025
8e3ba72
Feature: Implement chunk processing and maximum embedding length conf…
x22x22 Aug 5, 2025
f24b546
restore
x22x22 Aug 5, 2025
b46791b
restore
x22x22 Aug 5, 2025
54c7930
Feature: Implementation of Chunk Processing for Embedding Requests of…
x22x22 Aug 5, 2025
1ad1ae3
revert: restore processor.py and registry.py to main branch state
x22x22 Aug 6, 2025
35e0aee
Refactor: Enhance the code structure and error handling logic for emb…
x22x22 Aug 6, 2025
483be3e
Refactor: Enhance the code structure and error handling logic for emb…
x22x22 Aug 6, 2025
d410c34
Refactor: Enhance the code structure and error handling logic for emb…
x22x22 Aug 6, 2025
8880316
Feature: Implementation of an automatic chunking mechanism for long t…
x22x22 Aug 6, 2025
6e62421
Feature: Implementation of an automatic chunking mechanism for long t…
x22x22 Aug 6, 2025
ae380ed
Refactoring inelegant code
x22x22 Aug 6, 2025
3ce8d47
Refactoring inelegant code
x22x22 Aug 7, 2025
54ad46e
Refactoring inelegant code
x22x22 Aug 7, 2025
503ab00
Refactoring inelegant code
x22x22 Aug 7, 2025
a48c7c4
Merge branch 'main' into feat/support-long-text-embedding
x22x22 Aug 11, 2025
8949c8f
Refactoring inelegant code
x22x22 Aug 11, 2025
d42419e
Refactoring inelegant code
x22x22 Aug 11, 2025
ac5b69a
Refactoring inelegant code
x22x22 Aug 11, 2025
b8fe266
Refactoring inelegant code
x22x22 Aug 11, 2025
e9a5d70
Refactoring inelegant code
x22x22 Aug 11, 2025
d0c1c9e
Refactoring inelegant code
x22x22 Aug 11, 2025
4de2c2b
Update vllm/entrypoints/openai/serving_embedding.py
x22x22 Aug 11, 2025
dc067f3
Refactoring inelegant code
x22x22 Aug 11, 2025
cf19859
Update vllm/entrypoints/openai/serving_embedding.py
x22x22 Aug 13, 2025
8fab603
Update vllm/entrypoints/openai/serving_embedding.py
x22x22 Aug 13, 2025
8c7d56b
Update vllm/entrypoints/openai/serving_embedding.py
x22x22 Aug 13, 2025
fa3b69f
Refactoring inelegant code
x22x22 Aug 13, 2025
6584107
Refactoring inelegant code
x22x22 Aug 13, 2025
f4d48ce
Refactoring inelegant code
x22x22 Aug 13, 2025
94a7576
Refactoring inelegant code
x22x22 Aug 13, 2025
3444141
Refactoring inelegant code
x22x22 Aug 13, 2025
ac02136
Refactoring inelegant code
x22x22 Aug 13, 2025
17c4317
Refactoring inelegant code
x22x22 Aug 13, 2025
8866b5d
Refactoring inelegant code
x22x22 Aug 13, 2025
b5230ed
Reduce diff
DarkLight1337 Aug 13, 2025
b362cbd
Simplify
DarkLight1337 Aug 13, 2025
15c462b
Merge branch 'main' into feat/support-long-text-embedding
DarkLight1337 Aug 13, 2025
d515efd
Refactoring inelegant code
x22x22 Aug 13, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
186 changes: 186 additions & 0 deletions examples/online_serving/openai_embedding_long_text/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Long Text Embedding with Chunked Processing

This directory contains examples for using vLLM's **chunked processing** feature to handle long text embedding that exceeds the model's maximum context length.

## 🚀 Quick Start

### Start the Server

Use the provided script to start a vLLM server with chunked processing enabled:

```bash
# Basic usage (supports very long texts up to ~3M tokens)
./service.sh

# Custom configuration with different models
MODEL_NAME="jinaai/jina-embeddings-v3" \
MAX_EMBED_LEN=1048576 \
./service.sh

# For extremely long documents
MODEL_NAME="intfloat/multilingual-e5-large" \
MAX_EMBED_LEN=3072000 \
./service.sh
```

### Test Long Text Embedding

Run the comprehensive test client:

```bash
python client.py
```

## 📁 Files

| File | Description |
|------|-------------|
| `service.sh` | Server startup script with chunked processing enabled |
| `client.py` | Comprehensive test client for long text embedding |

## ⚙️ Configuration

### Server Configuration

The key parameters for chunked processing are in the `--override-pooler-config`:

```json
{
"pooling_type": "auto",
"normalize": true,
"enable_chunked_processing": true,
"max_embed_len": 3072000
}
```

!!! note
`pooling_type` sets the model's own pooling strategy for processing within each chunk. The cross-chunk aggregation automatically uses MEAN strategy when input exceeds the model's native maximum length.

#### Chunked Processing Behavior

Chunked processing uses **MEAN aggregation** for cross-chunk combination when input exceeds the model's native maximum length:

| Component | Behavior | Description |
|-----------|----------|-------------|
| **Within chunks** | Model's native pooling | Uses the model's configured pooling strategy |
| **Cross-chunk aggregation** | Always MEAN | Weighted averaging based on chunk token counts |
| **Performance** | Optimal | All chunks processed for complete semantic coverage |

### Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_NAME` | `intfloat/multilingual-e5-large` | Embedding model to use (supports multiple models) |
| `PORT` | `31090` | Server port |
| `GPU_COUNT` | `1` | Number of GPUs to use |
| `MAX_EMBED_LEN` | `3072000` | Maximum embedding input length (supports very long documents) |
| `POOLING_TYPE` | `auto` | Model's native pooling type: `auto`, `MEAN`, `CLS`, `LAST` (only affects within-chunk pooling, not cross-chunk aggregation) |
| `API_KEY` | `EMPTY` | API key for authentication |

## 🔧 How It Works

1. **Enhanced Input Validation**: `max_embed_len` allows accepting inputs longer than `max_model_len` without environment variables
2. **Smart Chunking**: Text is split based on `max_position_embeddings` to maintain semantic integrity
3. **Unified Processing**: All chunks processed separately through the model using its configured pooling strategy
4. **MEAN Aggregation**: When input exceeds model's native length, results combined using token count-based weighted averaging across all chunks
5. **Consistent Output**: Final embeddings maintain the same dimensionality as standard processing

### Input Length Handling

- **Within max_embed_len**: Input is accepted and processed (up to 3M+ tokens)
- **Exceeds max_position_embeddings**: Chunked processing is automatically triggered
- **Exceeds max_embed_len**: Input is rejected with clear error message
- **No environment variables required**: Works without `VLLM_ALLOW_LONG_MAX_MODEL_LEN`

### Extreme Long Text Support

With `MAX_EMBED_LEN=3072000`, you can process:

- **Academic papers**: Full research papers with references
- **Legal documents**: Complete contracts and legal texts
- **Books**: Entire chapters or small books
- **Code repositories**: Large codebases and documentation

## 📊 Performance Characteristics

### Chunked Processing Performance

| Aspect | Behavior | Performance |
|--------|----------|-------------|
| **Chunk Processing** | All chunks processed with native pooling | Consistent with input length |
| **Cross-chunk Aggregation** | MEAN weighted averaging | Minimal overhead |
| **Memory Usage** | Proportional to number of chunks | Moderate, scalable |
| **Semantic Quality** | Complete text coverage | Optimal for long documents |

## 🧪 Test Cases

The test client demonstrates:

- ✅ **Short text**: Normal processing (baseline)
- ✅ **Medium text**: Single chunk processing
- ✅ **Long text**: Multi-chunk processing with aggregation
- ✅ **Very long text**: Many chunks processing
- ✅ **Extreme long text**: Document-level processing (100K+ tokens)
- ✅ **Batch processing**: Mixed-length inputs in one request
- ✅ **Consistency**: Reproducible results across runs

## 🐛 Troubleshooting

### Common Issues

1. **Chunked processing not enabled**:

```log
ValueError: This model's maximum position embeddings length is 4096 tokens...
```

**Solution**: Ensure `enable_chunked_processing: true` in pooler config

2. **Input exceeds max_embed_len**:

```log
ValueError: This model's maximum embedding input length is 3072000 tokens...
```

**Solution**: Increase `max_embed_len` in pooler config or reduce input length

3. **Memory errors**:

```log
RuntimeError: CUDA out of memory
```

**Solution**: Reduce chunk size by adjusting model's `max_position_embeddings` or use fewer GPUs

4. **Slow processing**:
**Expected**: Long text takes more time due to multiple inference calls

### Debug Information

Server logs show chunked processing activity:

```log
INFO: Input length 150000 exceeds max_position_embeddings 4096, will use chunked processing
INFO: Split input of 150000 tokens into 37 chunks (max_chunk_size: 4096)
```

## 🤝 Contributing

To extend chunked processing support to other embedding models:

1. Check model compatibility with the pooling architecture
2. Test with various text lengths
3. Validate embedding quality compared to single-chunk processing
4. Submit PR with test cases and documentation updates

## 🆕 Enhanced Features

### max_embed_len Parameter

The new `max_embed_len` parameter provides:

- **Simplified Configuration**: No need for `VLLM_ALLOW_LONG_MAX_MODEL_LEN` environment variable
- **Flexible Input Validation**: Accept inputs longer than `max_model_len` up to `max_embed_len`
- **Extreme Length Support**: Process documents with millions of tokens
- **Clear Error Messages**: Better feedback when inputs exceed limits
- **Backward Compatibility**: Existing configurations continue to work
Loading