-
Notifications
You must be signed in to change notification settings - Fork 293
[Doc] Visual Token Pruning #2861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
peterchen-intel
wants to merge
13
commits into
openvinotoolkit:master
Choose a base branch
from
peterchen-intel:doc/visual/token/pruning
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+74
−0
Open
Changes from 10 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
a3f6d32
[Doc] Visual Token Pruning
peterchen-intel 674f7ed
Wording update per copolit suggestion
peterchen-intel 03fe7da
Limit the model supported.
peterchen-intel ef02325
Update wording and conf code
peterchen-intel 664348e
Update wording
peterchen-intel 655d2a6
Trigger build for pull_request
peterchen-intel 92c84b8
Add read permission
peterchen-intel fd3cf32
Update wording per review comments
peterchen-intel 6393915
Update per review comments
peterchen-intel 3141506
Remove duplicate step
peterchen-intel dbc7a04
Update per review comment
peterchen-intel 03b9aca
Remove the changes to .github/workflows/deploy_gh_pages.yml
peterchen-intel cde7a6c
Merge branch 'master' into doc/visual/token/pruning
peterchen-intel File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
72 changes: 72 additions & 0 deletions
72
site/docs/concepts/optimization-techniques/visual-token-pruning.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,72 @@ | ||
| --- | ||
| sidebar_position: 5 | ||
| --- | ||
|
|
||
| # Visual Token Pruning (CDPruner) | ||
|
|
||
| ## Overview | ||
| Visual Token Pruning is a context compression technique for Multimodal / Visual Language Models (VLMs) that aims to enhance inference efficiency without significant performance degradation by identifying and removing redundant or less informative tokens. A representative approach is CDPruner, introduced in the paper [Beyond Attention or Similarity: Maximizing Conditional Diversity for Token Pruning in MLLMs](https://arxiv.org/pdf/2506.10967). Its main goal is to lower inference latency and memory footprint while retaining the visual information most relevant to the user's query. | ||
|
|
||
| Unlike traditional attention-based or similarity-based pruning techniques, which can either retain redundant tokens or neglect instruction relevance, CDPruner focuses on maximizing the conditional diversity of the retained visual tokens. Pruned tokens are removed from further attention computations, shrinking KV cache footprint, reducing TTFT and improving throughput. A relevance weighting factor controls the influence of instruction relevance during pruning, helping balance token reduction against the preservation of important visual details. | ||
|
|
||
| ## Conceptual Model | ||
| CDPruner operates on the sequence of visual token embeddings produced by the vision encoder before they are passed to the language model. Instead of forwarding all tokens, it selects a subset based on conditional diversity, combining token similarity and instruction relevance. | ||
|
|
||
| ### Token Partitioning The visual tokens are conceptually divided into: | ||
peterchen-intel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * Retained Tokens: A selected subset that provides diverse and instruction-relevant visual information. | ||
| * Pruned Tokens: Tokens excluded from further processing because they contribute redundant or low-relevance information. | ||
|
|
||
| High-level flow: | ||
| 1. Encode image producing N visual tokens (embeddings). | ||
| 2. Compute pairwise token similarity and per-token relevance scores. | ||
| 3. Relevance and similarity are combined into a conditional kernel. A greedy DPP-based MAP algorithm identifies the least important tokens to discard according to `pruning_ratio`, adjusting scores using `relevance_weight` to control the trade-off between diversity and relevance. | ||
| 4. Build reduced token set; subsequent generation attends only to retained tokens. | ||
|
|
||
| Effect: Pruning less important visual tokens reduces memory usage and can speed up generation; extremely high pruning may degrade answer quality for complex visual queries. | ||
|
|
||
| ## Configuration Interface | ||
| Visual Token Pruning is exposed through fields of `ov::genai::GenerationConfig`: | ||
|
|
||
| * `pruning_ratio` (integer, 0–99): Portion of visual tokens to prune, specified as an integer percentage. A value of 0 disables pruning. For example, `25` means prune 25% of the visual tokens (keep 75%). Out-of-range values (negative or >=100) are treated as 0 (disabled) to avoid eliminating the entire visual context. | ||
| * `relevance_weight` (float): Weighting factor applied when aggregating or scaling dominance scores. **Recommended range:** 0.0–1.0. A value of 0 disables relevance weighting (pruning is based solely on raw dominance scores), while higher values (up to 1.0) emphasize relevance, making pruning more conservative on borderline tokens. Values above 1.0 are allowed but may have diminishing or unpredictable effects; negative values are not recommended. Default in the sample is `0.5f`. | ||
|
|
||
| ### Sample Usage (Python Benchmark Script) | ||
| [samples/python/visual_language_chat/benchmark_vlm.py](https://github.com/openvinotoolkit/openvino.genai/tree/master/samples/python/visual_language_chat/benchmark_vlm.py) provides a convenient way to measure performance impact of pruning. | ||
|
|
||
| Minimal example (prune 70% of visual tokens on GPU): | ||
| ```bash | ||
| python benchmark_vlm.py \ | ||
| -m ./models/vlm \ | ||
| -i ./data/example.jpg \ | ||
| -p "What is on the image?" \ | ||
| -d GPU \ | ||
| --pruning_ratio 70 \ | ||
| --relevance_weight 0.6 | ||
| ``` | ||
|
|
||
| Relevant configuration excerpt: | ||
| ```python | ||
| config = ov_genai.GenerationConfig() | ||
| config.max_new_tokens = args.max_new_tokens | ||
| config.pruning_ratio = args.pruning_ratio | ||
| config.relevance_weight = args.relevance_weight | ||
| ``` | ||
|
|
||
| Pipeline creation and generation: | ||
| ```python | ||
| pipe = ov_genai.VLMPipeline(models_path, device, scheduler_config=scheduler_config) | ||
| res = pipe.generate(prompt, images=images, generation_config=config) | ||
| ``` | ||
|
|
||
| The script prints performance metrics (time-to-first-token TTFT, throughput, per-stage durations). Compare runs with different `--pruning_ratio` to quantify latency improvements and memory savings. | ||
|
|
||
| ## Performance & Benefits | ||
| * Reduced KV cache memory for visual tokens -> enables larger batch sizes or longer text generation within same memory budget. | ||
| * Lower per-step attention computations involving image tokens -> improved latency. | ||
| * Helpful for edge or GPU memory-constrained deployments (e.g., running VLM on integrated GPU with limited VRAM). | ||
|
|
||
| ## Current Limitations | ||
| * Current implementation assumes a standard image encoder output; exotic hierarchical or sparse encoders might require adjusted scoring strategies. | ||
| * Pruning is applied only after the initial image encoding; does not dynamically re-introduce pruned tokens later. | ||
| * Score computation details are internal; no per-token debug API is exposed yet. | ||
| * The current implementation supports Qwen-VL models only; support for other models will be added in a subsequent release. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.