This repository contains our evaluation of vision-language models on the EgoBlind benchmark, a dataset of egocentric (first-person) video questions designed to assess how well models can assist visually impaired individuals.
test_half_release.csv contains 1 283 questions across six categories:
| Category | Count |
|---|---|
| Information reading | 573 |
| Safety warnings | 305 |
| Navigation | 196 |
| Other resources | 99 |
| Tool use | 70 |
| Social communication | 37 |
Each question is paired with a video name, a timestamp, and up to four reference answers.
Extracts a single frame from each video at the annotated timestamp and sends it along with the question to an OpenAI vision model.
python run_openai.py \
--model gpt-4o \
--output gpt4o.predictions.jsonlOutput: one JSONL file with {"question_id", "pred"} per line.
Uses GPT-4o-mini to judge whether each prediction meaningfully matches any of the reference answers, producing a yes/no verdict and a score (0–5).
python eval.py \
--pred_path gpt4o.predictions.jsonl \
--test_path test_half_release.csvOutput:
result_<model>.json— per-question eval resultsmetrics_<model>.json— per-category and overall accuracy / average score
Extracts frames into frames/, then generates a static HTML report
(report.html) with image previews, questions, reference answers, model
predictions side-by-side, and evaluation verdicts.
python build_report.py # extract frames + build HTML
python build_report.py --skip-frames # rebuild HTML onlyOpen report.html in a browser to explore the results.
| Model | Accuracy | Avg Score (0–5) |
|---|---|---|
| gpt-4o | 49.8 % | 2.82 |
| gpt-5.2-chat-latest | 54.6 % | 3.01 |
See report.html for the full per-question breakdown.
test_half_release.csv # questions + reference answers
run_openai.py # prediction script
eval.py # evaluation script
build_report.py # frame extraction + HTML report
gpt4o.predictions.jsonl # GPT-4o predictions
gpt-5_2-chat-latest.predictions.jsonl # GPT-5.2 predictions
result_gpt4o.json # GPT-4o eval results
result_gpt-5_2-chat-latest.json # GPT-5.2 eval results
metrics_gpt4o.json # GPT-4o aggregate metrics
metrics_gpt-5_2-chat-latest.json # GPT-5.2 aggregate metrics
frames/ # extracted video frames (JPEG)
report.html # static comparison report
merged_splits/ # source videos (not committed, see original repo for source)