-
Notifications
You must be signed in to change notification settings - Fork 22
Open
Description
What
Let's talk about how to evaluate the quantized VL model.
First, running full benchmark evaluations after every change is prohibitively expensive when quantizing Qwen-VL. At the same time, relying on a single lightweight metric (e.g. perplexity) is insufficient for vision-language models (VLMs), where degradation often occurs in the vision encoder or multimodal fusion layers.
This issue proposes a two-stage evaluation strategy:
- Fast evaluation for rapid iteration and regression detection
- Final evaluation for quality validation and reportable results
Each stage has a clearly defined purpose and should not be used as a replacement for the other.
Design Principles
This design principles are a good start point when we decide the overall process.
- Separate iteration speed from evaluation rigor
- By separating iteration speed from evaluation rigor, we can iterate quickly during quantization development while still guaranteeing reliable, reportable metrics at the final stage.
- Use high-sensitivity, low-cost signals early
- Use official, comparable benchmarks only when needed
- Ensure FP16 vs quantized comparisons are fair and reproducible
Stage 1: Fast Evaluation (Iteration / Sanity Check)
- Quickly detect severe or obvious regressions caused by quantization
- Enable frequent experimentation with minimal cost
- Answer the question: “Did this quantization change break the model?”
Scope
- Small datasets
- Heuristic or relative metrics
- FP16 vs quantized comparison only
Methods
1. Text-only Perplexity (Optional)
- Run on a small text corpus if supported
- Used as a smoke test for language decoder degradation
- Known limitation: does not capture vision or multimodal failures
Exmaple
from transformers import AutoProcessor, AutoModelForVision2Seq, AutoTokenizer
from tico.quantization.wrapq.utils.metrics import perplexity
MODEL_ID = "Qwen/Qwen3-VL-4B-Instruct"
processor = AutoProcessor.from_pretrained(MODEL_ID, min_pixels=256*256, max_pixels=384*384)
tokenizer = getattr(processor, "tokenizer", None)
if tokenizer is None:
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
model = AutoModelForVision2Seq.from_pretrained(
MODEL_ID,
device_map="cpu",
)
model.eval()
texts = [
"The quick brown fox jumps over the lazy dog.",
"In a distant future, humans and machines coexist in uneasy peace.",
"Explain the difference between supervised and unsupervised learning in simple terms.",
]
enc = tokenizer(texts[0], return_tensors="pt")
input_ids = enc["input_ids"].to(model.device)
ppl = perplexity(model, input_ids, model.device)
print("Text-only PPL:", ppl)2. Mini VQA Evaluation
- Use a small subset (200–1,000 samples) from:
- VQAv2
- TextVQA
- (Optionally) DocVQA or OK-VQA
- Fixed prompt forcing the model to output only the final answer
- Metric:
- normalized exact match (or simple soft match)
- Highly sensitive to:
- vision encoder errors
- OCR degradation
- vision-language alignment issues
3. FP16 vs Quantized Output Agreement
- Evaluate on a fixed set of multimodal prompts
- Metrics:
- next-token top-1 agreement
- logit similarity (cosine or KL divergence)
- Does not require ground-truth labels
- Very sensitive to subtle regressions
4. Basic Performance Metrics (Optional)
- Prefill latency (with image input)
- Decode throughput (tokens/sec)
- Peak memory usage
- Relative speedup vs FP16
Stage 2: Final Evaluation (Validation / Reporting)
- Produce trustworthy, reportable metrics
- Confirm that quantization preserves end-to-end multimodal quality
- Answer the question: “Is this model ready for release or comparison?”
Scope
- Official datasets and splits
- Standardized evaluation protocols
- Absolute metrics suitable for documentation and comparison
Methods
1. Multimodal Benchmarks (Primary)
- Datasets:
- VQAv2
- TextVQA
- DocVQA (or equivalent)
- OK-VQA (optional)
- Protocol:
- official splits
- fixed prompting
- deterministic decoding (temperature = 0)
- Metrics:
- official accuracy (preferred)
- relative drop vs FP16 baseline
- category-level breakdowns when available
2. Language-only Evaluation (Secondary)
- Use LM Eval Harness (lm-eval) for standardized comparison
- Suggested tasks:
- MMLU
- HellaSwag
- ARC-Challenge
- Few-shot: 0 or low
- Prefer log-likelihood mode over generation
- Goal: ensure language decoder quality is preserved
stamalakhov
Metadata
Metadata
Assignees
Labels
No labels