-
Notifications
You must be signed in to change notification settings - Fork 75
Implement case analysis script #844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -0,0 +1,49 @@ | |||
import { OpenAI } from "mongodb-rag-core/openai"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recommend using the AI SDK's function generateText()
for this. wont need to make own constructor. can import from mongodb-rag-core/aiSdk
. there's lots of examples throughout the project for usage
{ | ||
role: "system", | ||
content: systemPrompt ?? "", | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if system prompt is not provided, then you shouldn't have an empty system message.
packages/scripts/src/assessCases.ts
Outdated
Given the expected answer, generate a number of possible prompts that could | ||
elicit that expected answer. | ||
*/ | ||
export const generatePromptsFromExpectedAnswer = async ({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to have lite evals for this.
not sure what metric you'd use to grade outputs, but at least good to have a few test cases that you can run this over for if you change the prompt.
packages/scripts/src/assessCases.ts
Outdated
const shortName = makeShortName(prompt); | ||
console.log(`Rating '${shortName}' with LLM...`); | ||
|
||
const [response] = await generate({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than raw text gen and telling it to do json, models support structured output generation, which'd be perfect for this use case.
i recommend using the AI SDK generateObject() function. takes schema from zod type and handles output parsing.
packages/scripts/src/assessCases.ts
Outdated
PROMPT: ${prompt} | ||
--- | ||
EXPECTED ANSWER: ${expected} | ||
`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`, | |
`.trim(), |
packages/scripts/src/assessCases.ts
Outdated
}; | ||
}; | ||
|
||
export const rateWithLlm = async ({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this LLM call really should have eval cases.
for input data+expectations, i recommend just running a few prompt/expectedResponses through the system that fall into a few different buckets, good, bad for different reasons, terrible, etc. then see what model outputs, and adjust the output to meet your expectation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, what model are you planning on using for it?
i recommend using a beefy reasoning model like o3 or gemini-2.5/claude 4 sonnet with high thinking budget.
you may also want to include a "thoughts" field in the json output, and instruct the model to provide reasoning for its scores. this lets you see what the model is 'thinking', and can help you debug the output more easily.
packages/scripts/src/assessCases.ts
Outdated
multiple dimensions. Return your evaluation as a JSON object with numeric percentage scores | ||
from 0 (poor) to 1 (excellent) up to 3 decimal places. Return only a JSON object (NOT IN MARKDOWN) with the following keys: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return your evaluation as a JSON object with numeric percentage scores
from 0 (poor) to 1 (excellent) up to 3 decimal places.
having LLM generate number 0-1 seems like a suboptimal prompting strategy. how to differentiate between .5 vs .543?? instead i recommend using likert scale scores, 1-5. provide a brief overview of what each score means in the prompt
packages/scripts/src/assessCases.ts
Outdated
- fit: how well the expected answer actually matches the prompt. | ||
- assumption: how much domain-specific knowledge is required to effectively answer the prompt. | ||
- impact: the business impact/relevance of the question and answer. Good examples: competitor questions, technical questions. Bad exaples: when was MongoDB founded? | ||
- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve. Only include this if the above scores are low. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to include a bit more direction on what the guidance should include, and when to include it.
- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve. Only include this if the above scores are low. | |
- guidance (string, optional): TERSELY and DIRECTLY detail the issue; suggest how to improve the prompt and/or response. Only include this if ANY of the above scores are less than 2. |
const openAiClient = new AzureOpenAI({ | ||
apiKey: OPENAI_API_KEY, | ||
endpoint: OPENAI_ENDPOINT, | ||
apiVersion: OPENAI_API_VERSION, | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this script to be run? if just from laptop, i recommend using Braintrust proxy. it contains more model options and has caching to make reruns much faster and cheaper.
example usage here: https://github.com/mongodb/chatbot/blob/main/packages/benchmarks/src/discovery/config.ts#L34-L40
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'll run directly from a new page on the Mercury dashboard as well as in other places as needed
packages/scripts/src/assessCases.ts
Outdated
- clarity: how well formulated and clear the prompt is. | ||
- fit: how well the expected answer actually matches the prompt. | ||
- assumption: how much domain-specific knowledge is required to effectively answer the prompt. | ||
- impact: the business impact/relevance of the question and answer. Good examples: competitor questions, technical questions. Bad exaples: when was MongoDB founded? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels particularly hard to grade given the prompt. consider how you can provide more context on what business impact means (i dont really know offhand, but claude 4 probably has some good dieas)
packages/scripts/src/assessCases.ts
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recommend breaking this file into a few separate ones to follow general patters of the repo. we typically have a separate file for each LLM-calling function and colocate a .eval.ts file text to it.
so for this file, break into:
generatePromptsFromExpectedAnswer.ts
forgeneratePromptsFromExpectedAnswer()
- plus
generatePromptsFromExpectedAnswer.eval.ts
(see below for more notes on this)
- plus
generateRating.ts
for rateWithLlm() (which i rec renaming togenerateRating
for consistency)- plus
generateRating.eval.ts
(see below for more notes on this)
- plus
assessRelevance.ts
forassessRelevance()
- think reasonable to kitchen sink other stuff in here, like
scoreVariants()
- this file should probably have a lite unit test suite too
- think reasonable to kitchen sink other stuff in here, like
packages/scripts/src/Case.ts
Outdated
export const LlmAsJudgment = z | ||
.object({ | ||
reasonableness: z.number(), | ||
clarity: z.number(), | ||
specificity: z.number(), | ||
fit: z.number(), | ||
assumption: z.number(), | ||
impact: z.number(), | ||
guidance: z.string(), | ||
}) | ||
.partial(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i recommend colocating this with the prompt, since if you're using structured outputs (as recommended below), this will become part of the prompt. you can also include more strict typing like .min() and .max(), plus .describe() for the model to see
packages/scripts/src/Case.ts
Outdated
export const RelevanceMetrics = z.object({ | ||
// normalized square magnitude difference (lower = closer = better) | ||
norm_sq_mag_diff: z.number(), | ||
// cosine similarity (are vectors pointing the same way?) [-1, 1] | ||
cos_similarity: z.number(), | ||
}); | ||
|
||
export type RelevanceMetrics = z.infer<typeof RelevanceMetrics>; | ||
|
||
export const ScoredPromptAndEmbeddings = PromptAndEmbeddings.and( | ||
z.object({ | ||
relevance: | ||
// embedding model name -> score | ||
z.record(z.string(), RelevanceMetrics), | ||
}) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think these zod types are actually used. instead can you just use typescript types/interfaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will regret this
"createQualityTestsYaml-aug-2023": "npm run build && node ./build/createAug2023QualityTestsYaml.js", | ||
"createQualityTestsYaml-sept-2023": "npm run build && node ./build/createSept2023QualityTestsYaml.js", | ||
"scrubMessages": "npm run build && node ./build/scrubMessages.js", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alphabetized
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is a very strong approach. some notes throughout about prompt context engineering and evals, plus some code organization thoughts
This adds the
assessCases
script that analyzes the prompt/expected answer pairs using two methodologies: