Skip to content

Evaluation Strategy for Quantizing Qwen-VL #448

@mhs4670go

Description

@mhs4670go

What

Let's talk about how to evaluate the quantized VL model.

First, running full benchmark evaluations after every change is prohibitively expensive when quantizing Qwen-VL. At the same time, relying on a single lightweight metric (e.g. perplexity) is insufficient for vision-language models (VLMs), where degradation often occurs in the vision encoder or multimodal fusion layers.

This issue proposes a two-stage evaluation strategy:

  1. Fast evaluation for rapid iteration and regression detection
  2. Final evaluation for quality validation and reportable results

Each stage has a clearly defined purpose and should not be used as a replacement for the other.

Design Principles

This design principles are a good start point when we decide the overall process.

  • Separate iteration speed from evaluation rigor
    • By separating iteration speed from evaluation rigor, we can iterate quickly during quantization development while still guaranteeing reliable, reportable metrics at the final stage.
  • Use high-sensitivity, low-cost signals early
  • Use official, comparable benchmarks only when needed
  • Ensure FP16 vs quantized comparisons are fair and reproducible

Stage 1: Fast Evaluation (Iteration / Sanity Check)

  • Quickly detect severe or obvious regressions caused by quantization
  • Enable frequent experimentation with minimal cost
  • Answer the question: “Did this quantization change break the model?”

Scope

  • Small datasets
  • Heuristic or relative metrics
  • FP16 vs quantized comparison only

Methods

1. Text-only Perplexity (Optional)

  • Run on a small text corpus if supported
  • Used as a smoke test for language decoder degradation
  • Known limitation: does not capture vision or multimodal failures
Exmaple
from transformers import AutoProcessor, AutoModelForVision2Seq, AutoTokenizer

from tico.quantization.wrapq.utils.metrics import perplexity

MODEL_ID = "Qwen/Qwen3-VL-4B-Instruct"

processor = AutoProcessor.from_pretrained(MODEL_ID, min_pixels=256*256, max_pixels=384*384)
tokenizer = getattr(processor, "tokenizer", None)
if tokenizer is None:
    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)

model = AutoModelForVision2Seq.from_pretrained(
    MODEL_ID,
    device_map="cpu",
)
model.eval()

texts = [
    "The quick brown fox jumps over the lazy dog.",
    "In a distant future, humans and machines coexist in uneasy peace.",
    "Explain the difference between supervised and unsupervised learning in simple terms.",
]

enc = tokenizer(texts[0], return_tensors="pt")
input_ids = enc["input_ids"].to(model.device)
ppl = perplexity(model, input_ids, model.device)
print("Text-only PPL:", ppl)

2. Mini VQA Evaluation

  • Use a small subset (200–1,000 samples) from:
    • VQAv2
    • TextVQA
    • (Optionally) DocVQA or OK-VQA
  • Fixed prompt forcing the model to output only the final answer
  • Metric:
    • normalized exact match (or simple soft match)
  • Highly sensitive to:
    • vision encoder errors
    • OCR degradation
    • vision-language alignment issues

3. FP16 vs Quantized Output Agreement

  • Evaluate on a fixed set of multimodal prompts
  • Metrics:
    • next-token top-1 agreement
    • logit similarity (cosine or KL divergence)
  • Does not require ground-truth labels
  • Very sensitive to subtle regressions

4. Basic Performance Metrics (Optional)

  • Prefill latency (with image input)
  • Decode throughput (tokens/sec)
  • Peak memory usage
  • Relative speedup vs FP16

Stage 2: Final Evaluation (Validation / Reporting)

  • Produce trustworthy, reportable metrics
  • Confirm that quantization preserves end-to-end multimodal quality
  • Answer the question: “Is this model ready for release or comparison?”

Scope

  • Official datasets and splits
  • Standardized evaluation protocols
  • Absolute metrics suitable for documentation and comparison

Methods

1. Multimodal Benchmarks (Primary)

  • Datasets:
    • VQAv2
    • TextVQA
    • DocVQA (or equivalent)
    • OK-VQA (optional)
  • Protocol:
    • official splits
    • fixed prompting
    • deterministic decoding (temperature = 0)
  • Metrics:
    • official accuracy (preferred)
    • relative drop vs FP16 baseline
    • category-level breakdowns when available

2. Language-only Evaluation (Secondary)

  • Use LM Eval Harness (lm-eval) for standardized comparison
  • Suggested tasks:
    • MMLU
    • HellaSwag
    • ARC-Challenge
  • Few-shot: 0 or low
  • Prefer log-likelihood mode over generation
  • Goal: ensure language decoder quality is preserved

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions