Skip to content

Implement case analysis script #844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Aug 12, 2025
Merged

Implement case analysis script #844

merged 15 commits into from
Aug 12, 2025

Conversation

cbush
Copy link
Collaborator

@cbush cbush commented Jul 24, 2025

This adds the assessCases script that analyzes the prompt/expected answer pairs using two methodologies:

  1. Answer relevance: given expected answer, generate N prompts that could elicit that answer, then compare their embeddings with the embedding of the original given prompt
  2. LLM as judge: use LLM to score the prompt/answer pair on a variety of metrics and to generate recommendations for improvement.

@cbush cbush marked this pull request as ready for review July 31, 2025 00:19
@cbush cbush changed the title Case analysis WIP Implement case analysis script Jul 31, 2025
@@ -0,0 +1,49 @@
import { OpenAI } from "mongodb-rag-core/openai";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recommend using the AI SDK's function generateText() for this. wont need to make own constructor. can import from mongodb-rag-core/aiSdk. there's lots of examples throughout the project for usage

Comment on lines 24 to 27
{
role: "system",
content: systemPrompt ?? "",
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if system prompt is not provided, then you shouldn't have an empty system message.

Given the expected answer, generate a number of possible prompts that could
elicit that expected answer.
*/
export const generatePromptsFromExpectedAnswer = async ({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have lite evals for this.

not sure what metric you'd use to grade outputs, but at least good to have a few test cases that you can run this over for if you change the prompt.

const shortName = makeShortName(prompt);
console.log(`Rating '${shortName}' with LLM...`);

const [response] = await generate({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than raw text gen and telling it to do json, models support structured output generation, which'd be perfect for this use case.

i recommend using the AI SDK generateObject() function. takes schema from zod type and handles output parsing.

PROMPT: ${prompt}
---
EXPECTED ANSWER: ${expected}
`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`,
`.trim(),

};
};

export const rateWithLlm = async ({
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this LLM call really should have eval cases.

for input data+expectations, i recommend just running a few prompt/expectedResponses through the system that fall into a few different buckets, good, bad for different reasons, terrible, etc. then see what model outputs, and adjust the output to meet your expectation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, what model are you planning on using for it?

i recommend using a beefy reasoning model like o3 or gemini-2.5/claude 4 sonnet with high thinking budget.

you may also want to include a "thoughts" field in the json output, and instruct the model to provide reasoning for its scores. this lets you see what the model is 'thinking', and can help you debug the output more easily.

Comment on lines 188 to 189
multiple dimensions. Return your evaluation as a JSON object with numeric percentage scores
from 0 (poor) to 1 (excellent) up to 3 decimal places. Return only a JSON object (NOT IN MARKDOWN) with the following keys:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return your evaluation as a JSON object with numeric percentage scores
from 0 (poor) to 1 (excellent) up to 3 decimal places.

having LLM generate number 0-1 seems like a suboptimal prompting strategy. how to differentiate between .5 vs .543?? instead i recommend using likert scale scores, 1-5. provide a brief overview of what each score means in the prompt

- fit: how well the expected answer actually matches the prompt.
- assumption: how much domain-specific knowledge is required to effectively answer the prompt.
- impact: the business impact/relevance of the question and answer. Good examples: competitor questions, technical questions. Bad exaples: when was MongoDB founded?
- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve. Only include this if the above scores are low.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to include a bit more direction on what the guidance should include, and when to include it.

Suggested change
- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve. Only include this if the above scores are low.
- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve the prompt and/or response. Only include this if ANY of the above scores are less than 2.

Comment on lines +28 to +32
const openAiClient = new AzureOpenAI({
apiKey: OPENAI_API_KEY,
endpoint: OPENAI_ENDPOINT,
apiVersion: OPENAI_API_VERSION,
});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this script to be run? if just from laptop, i recommend using Braintrust proxy. it contains more model options and has caching to make reruns much faster and cheaper.

example usage here: https://github.com/mongodb/chatbot/blob/main/packages/benchmarks/src/discovery/config.ts#L34-L40

Copy link
Collaborator

@nlarew nlarew Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'll run directly from a new page on the Mercury dashboard as well as in other places as needed

- clarity: how well formulated and clear the prompt is.
- fit: how well the expected answer actually matches the prompt.
- assumption: how much domain-specific knowledge is required to effectively answer the prompt.
- impact: the business impact/relevance of the question and answer. Good examples: competitor questions, technical questions. Bad exaples: when was MongoDB founded?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels particularly hard to grade given the prompt. consider how you can provide more context on what business impact means (i dont really know offhand, but claude 4 probably has some good dieas)

Copy link
Collaborator

@mongodben mongodben Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recommend breaking this file into a few separate ones to follow general patters of the repo. we typically have a separate file for each LLM-calling function and colocate a .eval.ts file text to it.

so for this file, break into:

  • generatePromptsFromExpectedAnswer.ts for generatePromptsFromExpectedAnswer()
    • plus generatePromptsFromExpectedAnswer.eval.ts (see below for more notes on this)
  • generateRating.ts for rateWithLlm() (which i rec renaming to generateRating for consistency)
    • plus generateRating.eval.ts (see below for more notes on this)
  • assessRelevance.ts for assessRelevance()
    • think reasonable to kitchen sink other stuff in here, like scoreVariants()
    • this file should probably have a lite unit test suite too

Comment on lines 36 to 46
export const LlmAsJudgment = z
.object({
reasonableness: z.number(),
clarity: z.number(),
specificity: z.number(),
fit: z.number(),
assumption: z.number(),
impact: z.number(),
guidance: z.string(),
})
.partial();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i recommend colocating this with the prompt, since if you're using structured outputs (as recommended below), this will become part of the prompt. you can also include more strict typing like .min() and .max(), plus .describe() for the model to see

Comment on lines 15 to 30
export const RelevanceMetrics = z.object({
// normalized square magnitude difference (lower = closer = better)
norm_sq_mag_diff: z.number(),
// cosine similarity (are vectors pointing the same way?) [-1, 1]
cos_similarity: z.number(),
});

export type RelevanceMetrics = z.infer<typeof RelevanceMetrics>;

export const ScoredPromptAndEmbeddings = PromptAndEmbeddings.and(
z.object({
relevance:
// embedding model name -> score
z.record(z.string(), RelevanceMetrics),
})
);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think these zod types are actually used. instead can you just use typescript types/interfaces?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will regret this

"createQualityTestsYaml-aug-2023": "npm run build && node ./build/createAug2023QualityTestsYaml.js",
"createQualityTestsYaml-sept-2023": "npm run build && node ./build/createSept2023QualityTestsYaml.js",
"scrubMessages": "npm run build && node ./build/scrubMessages.js",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alphabetized

Copy link
Collaborator

@mongodben mongodben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is a very strong approach. some notes throughout about prompt context engineering and evals, plus some code organization thoughts

@cbush cbush requested a review from mongodben August 5, 2025 15:06
@nlarew nlarew merged commit 671a645 into main Aug 12, 2025
1 check passed
@nlarew nlarew deleted the case-analysis branch August 12, 2025 16:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants