πΉ Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models
Xuyang Liu1,2*, Yiyu Wang1*, Junpeng Ma3, Linfeng Zhang1β
1 EPIC Lab, Shanghai Jiao Tong University, 2Sichuan University, 3Fudan University
β‘ The first token compression framework for VideoLLMs featuring dynamic frame budget allocation.
2026.02.21ππ Our STC has been accepted by CVPR 2026! Code is available!2026.01.22β β We integrated three representative baselines FastV, VisionZip, and HoliTom into our codebase (see theqwenbranch), with support for Qwen3-VL.2026.01.08β β Added support for Qwen2.5-Omni and Qwen3-Omni in theomnibranch, with evaluation results. To date, VidCom2 has been fully adapted to the Qwen-VL, Qwen-Omni, and LLaVA model series.2025.12.30β β Added support for Qwen2.5-VL and Qwen3-VL in theqwenbranch, with evaluation results.2025.12.02π€π€ We release our latest work STC, the first plug-and-play inference acceleration framework for streaming video understanding! Code is available!2025.08.21ππ Our VidCom2 has been accepted by EMNLP 2025 main conference!2025.05.30β‘β‘ We are excited to release VidCom2 implementation for Qwen2-VL!2025.05.21π€π€ We release VidCom2, a plug-and-play inference acceleration method of VideoLLMs. Code is available!
- Model Adaptability: Compatible with most VideoLLMs (e.g., LLaVA, Qwen-VL, Qwen-Omni series).
- Operator Compatibility: Works seamlessly with efficient operators like Flash Attention 2.
- Strong Performance: Uses only 25% of tokens while maintaining 99.6% performance of LLaVA-OV.
- High Efficiency: Cuts LLaVA-OV generation time by 70.8% and overall latency by 43.0%.
TLDR: We present VidCom2, a plug-and-play framework that dynamically compresses video tokens based on frame uniqueness, achieving state-of-the-art efficiency and performance across various VideoLLMs and benchmarks.
The core implementation of our code is in token_compressor/vidcom2/vidcom2.py.
| Model | Path |
|---|---|
| LLaVA-OneVision | token_compressor/vidcom2/models/llava.py |
| LLaVA-Video | token_compressor/vidcom2/models/llava.py |
| Qwen2-VL | token_compressor/vidcom2/models/qwen2_vl.py |
| Qwen2.5-VL | token_compressor/vidcom2/models/qwen2_5_vl.py |
| Qwen3-VL | token_compressor/vidcom2/models/qwen3_vl.py |
| Qwen2.5-Omni | token_compressor/vidcom2/models/qwen2_5_omni.py |
| Qwen3-Omni | token_compressor/vidcom2/models/qwen3_omni.py |
Minimal Integration Snippets (click to expand)
These changes are implemented in the lmms-eval model wrappers:
LLaVA-OneVision (lmms-eval/lmms_eval/models/llava_onevision.py)
import os
import types
from token_compressor.vidcom2.models.llava import cus_prepare_inputs_labels_for_multimodal
if os.getenv("COMPRESSOR") == "vidcom2":
self.model.prepare_inputs_labels_for_multimodal = types.MethodType(
cus_prepare_inputs_labels_for_multimodal, self.model
)
eval_logger.info("[VidCom2] Successfully integrated VidCom2 with LLaVA-OneVision-7B.")Qwen3-VL (lmms-eval/lmms_eval/models/qwen3_vl.py)
import os
import types
from token_compressor.vidcom2.models.qwen3_vl import Qwen3VLModel_forward
if os.getenv("COMPRESSOR") == "vidcom2":
self._model.model.forward = types.MethodType(Qwen3VLModel_forward, self._model.model)
eval_logger.success("[VidCom2] Successfully integrated VidCom2 with Qwen3-VL.")Env knobs
- Enable compression:
COMPRESSOR=vidcom2 - Set retention ratio:
R_RATIO=0.25(default 0.25)
- Clone this repositoryοΌ
git clone https://github.com/xuyang-liu16/VidCom2.git
cd VidCom2- Environment Setup and Preparation:
conda create -n VidCom2 python=3.10 -y
conda activate VidCom2
pip install --upgrade pip # Enable PEP 660 support.
pip install -e ".[train]"- Install lmms-eval:
If you want to measure the latency and GPU memory, please use the custom installation.
cd lmms-eval
pip install -e .
Or you can also use the official installation.
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.gitWe utilize the lmms-eval toolkit for model evaluation.
Branch Note: The main branch only supports LLaVA series inference. To run Qwen models, please switch to the
qwenbranch.
π‘ Configuration Notes:
- VidCom2 Compression: Enable by prepending
COMPRESSOR=vidcom2to the command.- Retention Ratio: Setting by prepending
R_RATIOto the command. The default retention ratio is set to 0.25.- Flash Attention: While optional, we strongly recommend enabling Flash Attention 2 to replicate the efficiency results reported in our paper.
Below are the evaluation scripts for supported models:
To evaluate LLaVA-OneVision-7B with VidCom2, you can use:
COMPRESSOR=vidcom2 R_RATIO=0.25 accelerate launch --num_processes=8 \
-m lmms_eval \
--model llava_onevision \
--model_args pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen,attn_implementation=flash_attention_2 \
--tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_onevision \
--output_path ./logs/
To evaluate LLaVA-Video-7B with VidCom2, you can use:
COMPRESSOR=vidcom2 R_RATIO=0.25 accelerate launch --num_processes=8 \
-m lmms_eval \
--model llava_vid \
--model_args pretrained=lmms-lab/LLaVA-Video-7B-Qwen2,conv_template=qwen_1_5,max_frames_num=64,mm_spatial_pool_mode=average,attn_implementation=flash_attention_2 \
--tasks videomme,mlvu_dev,longvideobench_val_v,mvbench \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_vid \
--output_path ./logs/
Example format for LLaVA-OV-7B with VidCom2 (R_RATIO=0.25) on 8*H100 GPUs:
| Metric | Value |
|---|---|
| LLM_time_s | 96.264 |
| Total_time_s | 560.816 |
| Peak_mem_MB | 19057.5 |
Please consider citing our paper in your publications, if our findings help your research.
@article{liu2025vidcom2,
title={Video Compression Commander: Plug-and-Play Inference Acceleration for Video Large Language Models},
author={Liu, Xuyang and Wang, Yiyu and Ma, Junpeng and Zhang, Linfeng},
journal={arXiv preprint arXiv:2505.14454},
year={2025}
}We extend our gratitude to the open-source efforts of LLaVA-OneVision and Qwen2-VL.
For any question about our paper or code, please email liuxuyang@stu.scu.edu.cn or ustywan8@ljmu.ac.uk.

