Feat/own sft model#38
Merged
Merged
Conversation
Add a scoped `eval:gemma` target that runs the MDMA author prompt against Gemma 4 (via OpenRouter). Model selection is comment-toggleable in promptfooconfig.gemma.yaml (26B-a4b active, 31B ready to swap in). Outputs write to the scoped evals/gemma/ directory — kept out of the root evals/results*.json gitignore so generated MDMA can be reused downstream. Baseline (gemma-4-26b-a4b-it): 28/28 cases pass the validator suite. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ation, not published) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The gemma eval suite and dataset generator are kept local (gitignored) and not published, so remove the eval:gemma* and dataset:* scripts that point at gemma/ paths. own-model and base eval scripts are unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the 5 root-level internal planning/serving docs (endpoint URLs, Modal auth scheme, budgets, troubleshooting) and scrub references to them from evals configs, prompts, and own-model README. Kept locally, not published. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…or prompt The mdma-il model reads an MDMA-IL DSL intent, so its system prompt must describe the DSL grammar; the previous variant had none. Promote the eval harness's DSL-aware authoring prompt (grammar + worked examples) to the prompt-pack variant as the single source of truth, and repoint the author/custom/conversation eval suites to import it via getAuthorPromptVariant. Delete the now-duplicated evals/own-model/authoring-system-prompt.mjs and refresh the stale 'thin prompt' wording in the README/configs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
NtTestAlert
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Adds first-class support for Mobile Reality's own self-hosted MDMA-IL model
(fine-tuned Gemma-4 + MDMA-IL LoRA, now public on
Hugging Face)
alongside the third-party models, plus the prompts, evals, and validator/parser
hardening that make its DSL→MDMA output reliable.
Highlights:
mobile-reality/mdma-ilauthor + agent prompts andgoogle/gemmaauthor + fixer prompts, wired into the author registry. Themobile-reality/mdma-ilauthor variant is DSL-aware (DSL input grammar +authoring rules + worked examples) and is the single source of truth (it was
previously duplicated in the eval harness).
<thinking …>,<think>, etc.) as invalid MDMA under a dedicatedhtml-tagsrule id, so aconsumer can
excludeit without silencing real YAML errors. (Our DSL modeloccasionally leaks these.)
values starting with a YAML indicator char (
unit: %,range: > 40 mg/dL)and
colon-spacevalues (label: Example: Revenue).evals/own-model/suite (private package) thatgates the hosted DSL model on the held-out scenarios, plus its assertions.
provider wiring in the agent client.
Hugging Face links.
Closes #
Type of Change
Packages Affected
@mobile-reality/mdma-spec@mobile-reality/mdma-parser@mobile-reality/mdma-runtime@mobile-reality/mdma-attachables-core@mobile-reality/mdma-renderer-react@mobile-reality/mdma-prompt-pack@mobile-reality/mdma-validator,@mobile-reality/mdma-cli,demo,evals(private)Checklist
pnpm formatandpnpm lintpass).pnpm test).pnpm typecheck).pnpm changeset) if this change affects published packages.sensitive: truewhere appropriate.How to Test
pnpm install && pnpm buildgetAuthorPromptVariant('mobile-reality/mdma-il').promptreturns the DSL-aware prompt (DSL grammar + worked examples).
<thinking>…</thinking>fails with anhtml-tagserror; excludinghtml-tagssuppresses only that error.unit: %andlabel: Example: Revenueparse tostrings instead of throwing / nesting.
OWN_MODEL_*inevals/.env, thenpnpm --filter @mobile-reality/mdma-evals eval:own-model.README.mdand verify the speed-comparison section renders thetwo GIFs side by side and the Hugging Face links resolve.
Screenshots / Examples
See the README Speed comparison section (
assets/gpt-5.5.gifvs.assets/own-model.gif).