Implement case analysis script #844

cbush · 2025-07-24T20:43:22Z

This adds the assessCases script that analyzes the prompt/expected answer pairs using two methodologies:

Answer relevance: given expected answer, generate N prompts that could elicit that answer, then compare their embeddings with the embedding of the original given prompt
LLM as judge: use LLM to score the prompt/answer pair on a variety of metrics and to generate recommendations for improvement.

mongodben · 2025-07-31T13:57:43Z

packages/scripts/src/SimpleTextGenerator.ts

@@ -0,0 +1,49 @@
+import { OpenAI } from "mongodb-rag-core/openai";


i recommend using the AI SDK's function generateText() for this. wont need to make own constructor. can import from mongodb-rag-core/aiSdk. there's lots of examples throughout the project for usage

mongodben · 2025-07-31T14:00:20Z

packages/scripts/src/SimpleTextGenerator.ts

+      {
+        role: "system",
+        content: systemPrompt ?? "",
+      },


if system prompt is not provided, then you shouldn't have an empty system message.

mongodben · 2025-07-31T14:01:54Z

packages/scripts/src/assessCases.ts

+  Given the expected answer, generate a number of possible prompts that could
+  elicit that expected answer.
+ */
+export const generatePromptsFromExpectedAnswer = async ({


would be good to have lite evals for this.

not sure what metric you'd use to grade outputs, but at least good to have a few test cases that you can run this over for if you change the prompt.

mongodben · 2025-07-31T14:05:17Z

packages/scripts/src/assessCases.ts

+  const shortName = makeShortName(prompt);
+  console.log(`Rating '${shortName}' with LLM...`);
+
+  const [response] = await generate({


rather than raw text gen and telling it to do json, models support structured output generation, which'd be perfect for this use case.

i recommend using the AI SDK generateObject() function. takes schema from zod type and handles output parsing.

mongodben · 2025-07-31T14:07:21Z

packages/scripts/src/assessCases.ts

+PROMPT: ${prompt}
+---
+EXPECTED ANSWER: ${expected}
+`,


Suggested change

`,

`.trim(),

packages/scripts/src/assessCases.ts

mongodben · 2025-07-31T14:10:14Z

packages/scripts/src/assessCases.ts

+  };
+};
+
+export const rateWithLlm = async ({


this LLM call really should have eval cases.

for input data+expectations, i recommend just running a few prompt/expectedResponses through the system that fall into a few different buckets, good, bad for different reasons, terrible, etc. then see what model outputs, and adjust the output to meet your expectation.

also, what model are you planning on using for it?

i recommend using a beefy reasoning model like o3 or gemini-2.5/claude 4 sonnet with high thinking budget.

you may also want to include a "thoughts" field in the json output, and instruct the model to provide reasoning for its scores. this lets you see what the model is 'thinking', and can help you debug the output more easily.

mongodben · 2025-07-31T14:12:41Z

packages/scripts/src/assessCases.ts

+multiple dimensions. Return your evaluation as a JSON object with numeric percentage scores
+from 0 (poor) to 1 (excellent) up to 3 decimal places. Return only a JSON object (NOT IN MARKDOWN) with the following keys:


Return your evaluation as a JSON object with numeric percentage scores
from 0 (poor) to 1 (excellent) up to 3 decimal places.

having LLM generate number 0-1 seems like a suboptimal prompting strategy. how to differentiate between .5 vs .543?? instead i recommend using likert scale scores, 1-5. provide a brief overview of what each score means in the prompt

packages/scripts/src/assessCases.ts

mongodben · 2025-07-31T14:17:28Z

packages/scripts/src/assessCases.ts

+- fit: how well the expected answer actually matches the prompt.
+- assumption: how much domain-specific knowledge is required to effectively answer the prompt.
+- impact: the business impact/relevance of the question and answer. Good examples: competitor questions, technical questions. Bad exaples: when was MongoDB founded?
+- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve. Only include this if the above scores are low.


updated to include a bit more direction on what the guidance should include, and when to include it.

Suggested change

- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve. Only include this if the above scores are low.

- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve the prompt and/or response. Only include this if ANY of the above scores are less than 2.

mongodben · 2025-07-31T14:19:51Z

packages/scripts/src/main/assessCasesMain.ts

+  const openAiClient = new AzureOpenAI({
+    apiKey: OPENAI_API_KEY,
+    endpoint: OPENAI_ENDPOINT,
+    apiVersion: OPENAI_API_VERSION,
+  });


where is this script to be run? if just from laptop, i recommend using Braintrust proxy. it contains more model options and has caching to make reruns much faster and cheaper.

example usage here: https://github.com/mongodb/chatbot/blob/main/packages/benchmarks/src/discovery/config.ts#L34-L40

Ideally we'll run directly from a new page on the Mercury dashboard as well as in other places as needed

mongodben · 2025-07-31T14:21:44Z

packages/scripts/src/assessCases.ts

+- clarity: how well formulated and clear the prompt is.
+- fit: how well the expected answer actually matches the prompt.
+- assumption: how much domain-specific knowledge is required to effectively answer the prompt.
+- impact: the business impact/relevance of the question and answer. Good examples: competitor questions, technical questions. Bad exaples: when was MongoDB founded?


this feels particularly hard to grade given the prompt. consider how you can provide more context on what business impact means (i dont really know offhand, but claude 4 probably has some good dieas)

packages/scripts/src/assessCases.ts

mongodben · 2025-07-31T14:31:51Z

packages/scripts/src/assessCases.ts

i recommend breaking this file into a few separate ones to follow general patters of the repo. we typically have a separate file for each LLM-calling function and colocate a .eval.ts file text to it.

so for this file, break into:

generatePromptsFromExpectedAnswer.ts for generatePromptsFromExpectedAnswer()

plus generatePromptsFromExpectedAnswer.eval.ts (see below for more notes on this)

generateRating.ts for rateWithLlm() (which i rec renaming to generateRating for consistency)

plus generateRating.eval.ts (see below for more notes on this)

assessRelevance.ts for assessRelevance()

think reasonable to kitchen sink other stuff in here, like scoreVariants()

this file should probably have a lite unit test suite too

mongodben · 2025-07-31T14:35:21Z

packages/scripts/src/Case.ts

+export const LlmAsJudgment = z
+  .object({
+    reasonableness: z.number(),
+    clarity: z.number(),
+    specificity: z.number(),
+    fit: z.number(),
+    assumption: z.number(),
+    impact: z.number(),
+    guidance: z.string(),
+  })
+  .partial();


i recommend colocating this with the prompt, since if you're using structured outputs (as recommended below), this will become part of the prompt. you can also include more strict typing like .min() and .max(), plus .describe() for the model to see

mongodben · 2025-07-31T14:37:22Z

packages/scripts/src/Case.ts

+export const RelevanceMetrics = z.object({
+  // normalized square magnitude difference (lower = closer = better)
+  norm_sq_mag_diff: z.number(),
+  // cosine similarity (are vectors pointing the same way?) [-1, 1]
+  cos_similarity: z.number(),
+});
+
+export type RelevanceMetrics = z.infer<typeof RelevanceMetrics>;
+
+export const ScoredPromptAndEmbeddings = PromptAndEmbeddings.and(
+  z.object({
+    relevance:
+      // embedding model name -> score
+      z.record(z.string(), RelevanceMetrics),
+  })
+);


i dont think these zod types are actually used. instead can you just use typescript types/interfaces?

You will regret this

packages/scripts/src/Case.ts

cbush · 2025-07-31T00:20:02Z

packages/scripts/package.json

    "createQualityTestsYaml-aug-2023": "npm run build && node ./build/createAug2023QualityTestsYaml.js",
    "createQualityTestsYaml-sept-2023": "npm run build && node ./build/createSept2023QualityTestsYaml.js",
-    "scrubMessages": "npm run build && node ./build/scrubMessages.js",


alphabetized

mongodben

i think this is a very strong approach. some notes throughout about ~~prompt~~ context engineering and evals, plus some code organization thoughts

cbush added 6 commits July 24, 2025 14:51

Case analysis wip

12139ea

WIP

9efb650

WIP

0e9f998

WIP

25632a6

LLM as judge

7a318fa

WIP

822c4cd

cbush marked this pull request as ready for review July 31, 2025 00:19

cbush changed the title ~~Case analysis WIP~~ Implement case analysis script Jul 31, 2025

mongodben reviewed Jul 31, 2025

View reviewed changes

packages/scripts/src/assessCases.ts Outdated Show resolved Hide resolved

mongodben reviewed Jul 31, 2025

View reviewed changes

packages/scripts/src/assessCases.ts Outdated Show resolved Hide resolved

mongodben reviewed Jul 31, 2025

View reviewed changes

packages/scripts/src/assessCases.ts Outdated Show resolved Hide resolved

mongodben reviewed Jul 31, 2025

View reviewed changes

packages/scripts/src/Case.ts Outdated Show resolved Hide resolved

cbush commented Jul 31, 2025

View reviewed changes

mongodben requested changes Jul 31, 2025

View reviewed changes

cbush added 3 commits July 31, 2025 13:37

WIP

a7c2436

Merge remote-tracking branch 'origin/main' into case-analysis

ff92704

Eval WIP

7d8cefb

cbush force-pushed the case-analysis branch from 22b4703 to 7d8cefb Compare July 31, 2025 18:31

cbush added 6 commits July 31, 2025 14:36

WIP

0fe833d

Add other scores

26e57d8

WIP

da4df5f

Add evals

17abf77

Update docs

372ee47

Move dependent generate to assessRelevance

df8da1f

cbush requested a review from mongodben August 5, 2025 15:06

nlarew approved these changes Aug 12, 2025

View reviewed changes

nlarew merged commit 671a645 into main Aug 12, 2025
1 check passed

nlarew deleted the case-analysis branch August 12, 2025 16:29

		@@ -0,0 +1,49 @@
		import { OpenAI } from "mongodb-rag-core/openai";

		multiple dimensions. Return your evaluation as a JSON object with numeric percentage scores
		from 0 (poor) to 1 (excellent) up to 3 decimal places. Return only a JSON object (NOT IN MARKDOWN) with the following keys:

	- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve. Only include this if the above scores are low.
	- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve the prompt and/or response. Only include this if ANY of the above scores are less than 2.

Implement case analysis script #844

Implement case analysis script #844

Uh oh!

Conversation

cbush commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nlarew Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mongodben Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mongodben left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cbush commented Jul 24, 2025 •

edited

Loading

nlarew Jul 31, 2025 •

edited

Loading

mongodben Jul 31, 2025 •

edited

Loading