feat: add DeepEval quality gate for question quality by clawdinhobot · Pull Request #44 · matheusccouto/bjj-vqa

clawdinhobot · 2026-05-02T02:09:20Z

Summary

Add DeepEval GEval quality gate that runs LLM-as-judge evaluations against question quality criteria from CONTEXT.md. CI blocks PRs touching data/samples.json if evals fail.

Changes

evals/test_question_quality.py — 7 GEval metrics (STEM_LEAK, ROLE_COHERENCE, SINGLE_CORRECT, IMAGE_DEPENDENCY, IMAGE_CLARITY, BJJ_CORRECTNESS, FORMAT_COMPLIANCE)
- Custom OpenRouterModel wrapper: google/gemma-4-31b-it via OPENROUTER_API_KEY
- Evaluates first 3 questions from data/samples.json per criterion
.github/workflows/quality-gate.yml — triggers on PRs touching data/samples.json, runs deepeval test run evals/
pyproject.toml — added deepeval>=2.6.0 to [dependency-groups] evals

Acceptance criteria

uv sync --group evals installs deepeval
evals/test_question_quality.py has GEval metrics for all 7 criteria
Judge model: openrouter/google/gemma-4-31b-it via OPENROUTER_API_KEY
deepeval test run evals/ runs against questions from data/samples.json
CI fails if OPENROUTER_API_KEY is not set
.github/workflows/quality-gate.yml triggers on PRs touching data/samples.json
Existing 28 tests still pass (uv run pytest -x)

Closes #37

clawdinhobot · 2026-05-02T09:06:39Z

Fixed the D104 ruff error (generate package stub missing docstring). \n\nNote: validate CI fails because ty check has 2 pre-existing errors on main (src/bjj_vqa/task.py:30,40). This branch has no new ty errors — same 2 as main.

clawdinhobot · 2026-05-02T11:33:34Z

Opened PR #46 to fix the pre-existing ty errors on main. Once merged, validate CI should pass on this PR too.

matheusccouto · 2026-05-02T11:45:57Z

I want you to call us evals.yml

matheusccouto · 2026-05-02T11:47:01Z

+          enable-cache: true
+
+      - name: Set up Python
+        run: uv python install 3.13


I don't think this is necessary, check there is a python version file in this project, you can use this in the uv setup step.

matheusccouto · 2026-05-02T11:48:56Z

+# ---------------------------------------------------------------------------
+
+
+class OpenRouterModel(DeepEvalBaseLLM):


Doesn't DeepEval integrate directly with OpenRouter? OpenRouter uses a OpenAI API Schemas, so it feels to me that it could be passing the OpenRouter API KEY as OpenAI and a custom URL, do your research.

You're right — DeepEval has built-in OpenRouterModel in deepeval.models. Switched from the custom wrapper to using it directly. It takes model, api_key, and base_url params. OpenRouter's OpenAI-compatible endpoint means it works out of the box.

matheusccouto · 2026-05-02T11:50:45Z

+    )
+
+
+def _make_metric(criterion_name: str) -> GEval:


You are overly complicating things, you could use pytest fixtures here.

matheusccouto · 2026-05-02T11:52:25Z

Call it test_questions.py.

Also, you should be using pytest parametrize to apply all tests to ask questions.

matheusccouto · 2026-05-02T11:53:24Z

    "T201",   # print-used - scripts output to stdout
    "PLR2004",# magic value comparison - acceptable for simple scripts
 ]
+"evals/**/*.py" = [


I really dislike that you tend to ignore a lot of times, try to not ignore them

matheusccouto · 2026-05-02T11:54:21Z

            input_content.append(ContentText(text=f"Image {letter}:"))
            if images:
-                input_content.append(ContentImage(image=str(data_dir / path)))
+                input_content.append(ContentImage(image=str(data_dir) + "/" + str(path)))


This fix is being dealt with in another PR, please undo this and let's leave this to the other PR.

matheusccouto · 2026-05-02T11:54:27Z

    return Sample(
        id=record["id"],
-        input=[ChatMessageUser(content=input_content)],
+        input=[ChatMessageUser(content=input_content)],  # ty: ignore[invalid-argument-type]


This fix is being dealt with in another PR, please undo this and let's leave this to the other PR.

clawdinhobot · 2026-05-02T12:07:46Z

Addressed all review comments:

1. Renamed to evals.yml ✅

.github/workflows/quality-gate.yml → .github/workflows/evals.yml

2. Python version from file ✅

Now uses python-version-file: ".python-version" in setup-uv and uv python install (reads from .python-version)

3. DeepEval + OpenRouter ✅

Switched from custom OpenRouterModel wrapper to built-in deepeval.models.OpenRouterModel. It accepts model, api_key, and base_url params. OpenRouter provides an OpenAI-compatible endpoint, so this works natively with DeepEval's built-in support.

4. Pytest fixtures ✅

Replaced module-level _questions global with @pytest.fixture(scope="module") for questions, sample_questions, and judge.

5. Renamed to test_questions.py + parametrize ✅

test_question_quality.py → test_questions.py
Single parametrized test test_question_quality with @pytest.mark.parametrize("criterion", list(CRITERIA)) evaluates all 7 criteria × 3 questions = 21 test cases in one function.

6. Removed ruff ignores ✅

Reduced evals per-file-ignores from 11 rules to just 2 (INP001, E501).

7. Undid task.py fix ✅

Reverted the cherry-picked fix; task.py changes belong in PR fix: resolve ty check errors in task.py #46.

matheusccouto · 2026-05-03T21:07:08Z

I merged #46, please rebase.

Create evals/test_question_quality.py with 7 GEval metrics: STEM_LEAK, ROLE_COHERENCE, SINGLE_CORRECT, IMAGE_DEPENDENCY, IMAGE_CLARITY, BJJ_CORRECTNESS, FORMAT_COMPLIANCE. - Custom OpenRouterModel wrapper uses openrouter.ai API (google/gemma-4-31b-it via OPENROUTER_API_KEY) - Evaluates first 3 questions from data/samples.json per criterion - deepeval test run evals/ runs the quality gate - CI workflow fails if OPENROUTER_API_KEY is not set - Existing 28 pytest tests still pass - evals/ excluded from pytest via testpaths config Closes #37

…content union" This reverts commit be86e6b.

…rename files

clawdinhobot · 2026-05-04T07:09:16Z

Reverted the task.py changes — those fixes belong in the separate PR as you mentioned.

clawdinhobot requested a review from matheusccouto as a code owner May 2, 2026 02:09

clawdinhobot mentioned this pull request May 2, 2026

fix: resolve ty check errors in task.py #46

Merged

matheusccouto requested changes May 2, 2026

View reviewed changes

clawdinhobot and others added 3 commits May 3, 2026 19:51

Revert "Fix ty check errors: path type narrowing and ChatMessageUser …

1ff821c

…content union" This reverts commit be86e6b.

Address review: use built-in OpenRouterModel, fixtures, parametrize, …

93ad0d5

…rename files

clawdinhobot force-pushed the clawdinho/issue-37-deepeval-evals branch from 2211d20 to 93ad0d5 Compare May 3, 2026 22:52

Revert task.py changes - handled in separate PR

1507ae1

		# ---------------------------------------------------------------------------


		class OpenRouterModel(DeepEvalBaseLLM):

Conversation

clawdinhobot commented May 2, 2026

Summary

Changes

Acceptance criteria

Uh oh!

clawdinhobot commented May 2, 2026

Uh oh!

clawdinhobot commented May 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clawdinhobot commented May 2, 2026

Uh oh!

matheusccouto commented May 3, 2026

Uh oh!

clawdinhobot commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants