EgoBlind — VLM Evaluation on the EgoBlind Benchmark

This repository contains our evaluation of vision-language models on the EgoBlind benchmark, a dataset of egocentric (first-person) video questions designed to assess how well models can assist visually impaired individuals.

Dataset

test_half_release.csv contains 1 283 questions across six categories:

Category	Count
Information reading	573
Safety warnings	305
Navigation	196
Other resources	99
Tool use	70
Social communication	37

Each question is paired with a video name, a timestamp, and up to four reference answers.

Pipeline

1. Prediction (`run_openai.py`)

Extracts a single frame from each video at the annotated timestamp and sends it along with the question to an OpenAI vision model.

python run_openai.py \
  --model gpt-4o \
  --output gpt4o.predictions.jsonl

Output: one JSONL file with {"question_id", "pred"} per line.

2. Evaluation (`eval.py`)

Uses GPT-4o-mini to judge whether each prediction meaningfully matches any of the reference answers, producing a yes/no verdict and a score (0–5).

python eval.py \
  --pred_path gpt4o.predictions.jsonl \
  --test_path test_half_release.csv

Output:

result_<model>.json — per-question eval results
metrics_<model>.json — per-category and overall accuracy / average score

3. Report (`build_report.py`)

Extracts frames into frames/, then generates a static HTML report (report.html) with image previews, questions, reference answers, model predictions side-by-side, and evaluation verdicts.

python build_report.py                  # extract frames + build HTML
python build_report.py --skip-frames    # rebuild HTML only

Open report.html in a browser to explore the results.

Models evaluated

Model	Accuracy	Avg Score (0–5)
gpt-4o	49.8 %	2.82
gpt-5.2-chat-latest	54.6 %	3.01

See report.html for the full per-question breakdown.

Repository structure

test_half_release.csv               # questions + reference answers
run_openai.py                       # prediction script
eval.py                             # evaluation script
build_report.py                     # frame extraction + HTML report
gpt4o.predictions.jsonl             # GPT-4o predictions
gpt-5_2-chat-latest.predictions.jsonl  # GPT-5.2 predictions
result_gpt4o.json                   # GPT-4o eval results
result_gpt-5_2-chat-latest.json     # GPT-5.2 eval results
metrics_gpt4o.json                  # GPT-4o aggregate metrics
metrics_gpt-5_2-chat-latest.json    # GPT-5.2 aggregate metrics
frames/                             # extracted video frames (JPEG)
report.html                         # static comparison report
merged_splits/                      # source videos (not committed, see original repo for source)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EgoBlind — VLM Evaluation on the EgoBlind Benchmark

Dataset

Pipeline

1. Prediction (`run_openai.py`)

2. Evaluation (`eval.py`)

3. Report (`build_report.py`)

Models evaluated

Repository structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
frames		frames
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_report.py		build_report.py
eval.py		eval.py
gpt-5_2-chat-latest.predictions.jsonl		gpt-5_2-chat-latest.predictions.jsonl
gpt4o.predictions.jsonl		gpt4o.predictions.jsonl
metrics_gpt-5_2-chat-latest.json		metrics_gpt-5_2-chat-latest.json
metrics_gpt4o.json		metrics_gpt4o.json
report.html		report.html
result_gpt-5_2-chat-latest.json		result_gpt-5_2-chat-latest.json
result_gpt4o.json		result_gpt4o.json
run_openai.py		run_openai.py
test_half_release.csv		test_half_release.csv

Folders and files

Latest commit

History

Repository files navigation

EgoBlind — VLM Evaluation on the EgoBlind Benchmark

Dataset

Pipeline

1. Prediction (run_openai.py)

2. Evaluation (eval.py)

3. Report (build_report.py)

Models evaluated

Repository structure

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Prediction (`run_openai.py`)

2. Evaluation (`eval.py`)

3. Report (`build_report.py`)

Packages