feat: vlm bench warehouse tasks #668

MagdalenaKotynia · 2025-08-12T08:38:36Z

Purpose

To extend vlm benchmark with images from warehouse simulation and with tasks of different type.

Proposed Changes

Added Multiple Choice tasks and Quantity tasks
Added more tasks with images from warehouse simulation
Added results summaries creation
- Per task for all repeats within a model (tasks_summary.csv)
- Per model for all repeats and all tasks (model_summary.csv)
- For all models (benchmark_summary.csv)

Testing

If you want to use langfuse tracing, you need to do
export LANGFUSE_MAX_EVENT_SIZE_BYTES=20000000
because some tasks take more than 1 MB space as tracing item in Langfuse.

To test single model:

cd rai
source setup_shell.sh
python src/rai_bench/rai_bench/examples/vlm_benchmark.py --model-name gemma3:4b --vendor ollama

To test many models:

from rai_bench import (
    VLMBenchmarkConfig,
    test_models,
)

if __name__ == "__main__":
    # Define models you want to benchmark
    model_names = ["gpt-4o", "gpt-4o-mini", "gemma3:4b", "gemma3:12b", "llava:7b", "llava:13b", "minicpm-v", "llama3.2-vision:11b", "llava-llama3:8b", "qwen2.5vl:3b", "qwen2.5vl:7b", "moondream:1.8b", "granite3.2-vision", "bakllava:7b", "llava-phi3:3.8b"]
    vendors = ["openai", "openai", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama", "ollama"]

    vlm_bench_conf = VLMBenchmarkConfig(repeats=3)

    out_dir = "src/rai_bench/rai_bench/experiments"
    test_models(
        model_names=model_names,
        vendors=vendors,
        benchmark_configs=[vlm_bench_conf],
        out_dir=out_dir,
    )

Results

Results were collected from vlm models available through ollama that are smaller than 14b and on gpt-4o as reference. Results are below (count is number of tries per one task; bakllava:7b has less total_tasks because it gets stuck on some tasks - I tried to run it 3 times and this issue occured every time.)

merged_results_summary.csv

Results summary:

qwen2.5vl:7b has the best average success rate, but has a big average latency (avg_time)
qwen2.5vl:3b has imo the best trade-off between latency and success rate
quite good trade-off have also minicpm-v:8b and gemma3:4b

…base ImageReasoningTaskInput and ImageReasoningAnswer classes

…vlm benchmark

…s inputs and answers

…l across all tasks and repeats

MagdalenaKotynia marked this pull request as ready for review August 12, 2025 08:41

MagdalenaKotynia added 15 commits August 18, 2025 15:51

feat: interfaces for quantity and multiple choice tasks

e3fdd47

feat: created new tasks with warehouse simulation images

143e7c3

feat: added vlm bench config to test many models

e5652c0

chore: moved old image files from vlm benchmark to lfs

2b2e360

refactor: extracted common logic for vlm tasks inputs and answers to …

d566994

…base ImageReasoningTaskInput and ImageReasoningAnswer classes

docs: added info about increasing event size for Langfuse tracing of …

02b459b

…vlm benchmark

feat: added merging results summaries across all repeats and all models

466ec59

fix: fixed typing after refactor extracting common logic for vlm task…

1adc3d4

…s inputs and answers

refactor: reorganized tasks order by images

f791e85

chore: removed unused comment

fe1b5a8

refactor: created separate structure for storing summary for the mode…

90d163b

…l across all tasks and repeats

fix: fixed validation of parsed llm output

01179b6

refactor: added task id, refactor of storing task_input variables

30c6b6a

fix: aggregate tasks repeats results by the task id

084b484

refactor: renamed csv storing model summary

8e3c01c

MagdalenaKotynia force-pushed the mk/feat/vlm-bench-warehouse-tasks branch from a002700 to 8e3c01c Compare August 18, 2025 13:51

feat: added std success rate and std time for model across all repeats

5417762

MagdalenaKotynia requested a review from maciejmajek August 18, 2025 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: vlm bench warehouse tasks #668

feat: vlm bench warehouse tasks #668

Uh oh!

MagdalenaKotynia commented Aug 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

feat: vlm bench warehouse tasks #668

Are you sure you want to change the base?

feat: vlm bench warehouse tasks #668

Uh oh!

Conversation

MagdalenaKotynia commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Proposed Changes

Testing

Results

Uh oh!

Uh oh!

MagdalenaKotynia commented Aug 12, 2025 •

edited

Loading