Skip to content

lg-kialo/visual-assistance

Repository files navigation

EgoBlind — VLM Evaluation on the EgoBlind Benchmark

This repository contains our evaluation of vision-language models on the EgoBlind benchmark, a dataset of egocentric (first-person) video questions designed to assess how well models can assist visually impaired individuals.

Dataset

test_half_release.csv contains 1 283 questions across six categories:

Category Count
Information reading 573
Safety warnings 305
Navigation 196
Other resources 99
Tool use 70
Social communication 37

Each question is paired with a video name, a timestamp, and up to four reference answers.

Pipeline

1. Prediction (run_openai.py)

Extracts a single frame from each video at the annotated timestamp and sends it along with the question to an OpenAI vision model.

python run_openai.py \
  --model gpt-4o \
  --output gpt4o.predictions.jsonl

Output: one JSONL file with {"question_id", "pred"} per line.

2. Evaluation (eval.py)

Uses GPT-4o-mini to judge whether each prediction meaningfully matches any of the reference answers, producing a yes/no verdict and a score (0–5).

python eval.py \
  --pred_path gpt4o.predictions.jsonl \
  --test_path test_half_release.csv

Output:

  • result_<model>.json — per-question eval results
  • metrics_<model>.json — per-category and overall accuracy / average score

3. Report (build_report.py)

Extracts frames into frames/, then generates a static HTML report (report.html) with image previews, questions, reference answers, model predictions side-by-side, and evaluation verdicts.

python build_report.py                  # extract frames + build HTML
python build_report.py --skip-frames    # rebuild HTML only

Open report.html in a browser to explore the results.

Models evaluated

Model Accuracy Avg Score (0–5)
gpt-4o 49.8 % 2.82
gpt-5.2-chat-latest 54.6 % 3.01

See report.html for the full per-question breakdown.

Repository structure

test_half_release.csv               # questions + reference answers
run_openai.py                       # prediction script
eval.py                             # evaluation script
build_report.py                     # frame extraction + HTML report
gpt4o.predictions.jsonl             # GPT-4o predictions
gpt-5_2-chat-latest.predictions.jsonl  # GPT-5.2 predictions
result_gpt4o.json                   # GPT-4o eval results
result_gpt-5_2-chat-latest.json     # GPT-5.2 eval results
metrics_gpt4o.json                  # GPT-4o aggregate metrics
metrics_gpt-5_2-chat-latest.json    # GPT-5.2 aggregate metrics
frames/                             # extracted video frames (JPEG)
report.html                         # static comparison report
merged_splits/                      # source videos (not committed, see original repo for source)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors