feat: add DeepEval quality gate for question quality#44
feat: add DeepEval quality gate for question quality#44clawdinhobot wants to merge 4 commits intomainfrom
Conversation
|
Fixed the D104 ruff error (generate package stub missing docstring). \n\nNote: validate CI fails because |
|
Opened PR #46 to fix the pre-existing ty errors on main. Once merged, validate CI should pass on this PR too. |
There was a problem hiding this comment.
I want you to call us evals.yml
| enable-cache: true | ||
|
|
||
| - name: Set up Python | ||
| run: uv python install 3.13 |
There was a problem hiding this comment.
I don't think this is necessary, check there is a python version file in this project, you can use this in the uv setup step.
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| class OpenRouterModel(DeepEvalBaseLLM): |
There was a problem hiding this comment.
Doesn't DeepEval integrate directly with OpenRouter? OpenRouter uses a OpenAI API Schemas, so it feels to me that it could be passing the OpenRouter API KEY as OpenAI and a custom URL, do your research.
There was a problem hiding this comment.
You're right — DeepEval has built-in OpenRouterModel in deepeval.models. Switched from the custom wrapper to using it directly. It takes model, api_key, and base_url params. OpenRouter's OpenAI-compatible endpoint means it works out of the box.
| ) | ||
|
|
||
|
|
||
| def _make_metric(criterion_name: str) -> GEval: |
There was a problem hiding this comment.
You are overly complicating things, you could use pytest fixtures here.
There was a problem hiding this comment.
Call it test_questions.py.
Also, you should be using pytest parametrize to apply all tests to ask questions.
| "T201", # print-used - scripts output to stdout | ||
| "PLR2004",# magic value comparison - acceptable for simple scripts | ||
| ] | ||
| "evals/**/*.py" = [ |
There was a problem hiding this comment.
I really dislike that you tend to ignore a lot of times, try to not ignore them
| input_content.append(ContentText(text=f"Image {letter}:")) | ||
| if images: | ||
| input_content.append(ContentImage(image=str(data_dir / path))) | ||
| input_content.append(ContentImage(image=str(data_dir) + "/" + str(path))) |
There was a problem hiding this comment.
This fix is being dealt with in another PR, please undo this and let's leave this to the other PR.
| return Sample( | ||
| id=record["id"], | ||
| input=[ChatMessageUser(content=input_content)], | ||
| input=[ChatMessageUser(content=input_content)], # ty: ignore[invalid-argument-type] |
There was a problem hiding this comment.
This fix is being dealt with in another PR, please undo this and let's leave this to the other PR.
|
Addressed all review comments: 1. Renamed to evals.yml ✅
2. Python version from file ✅
3. DeepEval + OpenRouter ✅
4. Pytest fixtures ✅
5. Renamed to test_questions.py + parametrize ✅
6. Removed ruff ignores ✅
7. Undid task.py fix ✅
|
|
I merged #46, please rebase. |
Create evals/test_question_quality.py with 7 GEval metrics: STEM_LEAK, ROLE_COHERENCE, SINGLE_CORRECT, IMAGE_DEPENDENCY, IMAGE_CLARITY, BJJ_CORRECTNESS, FORMAT_COMPLIANCE. - Custom OpenRouterModel wrapper uses openrouter.ai API (google/gemma-4-31b-it via OPENROUTER_API_KEY) - Evaluates first 3 questions from data/samples.json per criterion - deepeval test run evals/ runs the quality gate - CI workflow fails if OPENROUTER_API_KEY is not set - Existing 28 pytest tests still pass - evals/ excluded from pytest via testpaths config Closes #37
…content union" This reverts commit be86e6b.
2211d20 to
93ad0d5
Compare
|
Reverted the task.py changes — those fixes belong in the separate PR as you mentioned. |
Summary
Add DeepEval GEval quality gate that runs LLM-as-judge evaluations against question quality criteria from
CONTEXT.md. CI blocks PRs touchingdata/samples.jsonif evals fail.Changes
evals/test_question_quality.py— 7 GEval metrics (STEM_LEAK, ROLE_COHERENCE, SINGLE_CORRECT, IMAGE_DEPENDENCY, IMAGE_CLARITY, BJJ_CORRECTNESS, FORMAT_COMPLIANCE)OpenRouterModelwrapper:google/gemma-4-31b-itviaOPENROUTER_API_KEYdata/samples.jsonper criterion.github/workflows/quality-gate.yml— triggers on PRs touchingdata/samples.json, runsdeepeval test run evals/pyproject.toml— addeddeepeval>=2.6.0to[dependency-groups] evalsAcceptance criteria
uv sync --group evalsinstalls deepevalevals/test_question_quality.pyhas GEval metrics for all 7 criteriaopenrouter/google/gemma-4-31b-itviaOPENROUTER_API_KEYdeepeval test run evals/runs against questions fromdata/samples.jsonOPENROUTER_API_KEYis not set.github/workflows/quality-gate.ymltriggers on PRs touchingdata/samples.jsonuv run pytest -x)Closes #37