Replies: 2 comments
-
|
Great question — this is a known challenge in LLM evaluation called scoring instability, and it comes from the non-deterministic nature of LLM-as-judge pipelines.
|
Beta Was this translation helpful? Give feedback.
-
|
This is scoring instability from non-deterministic LLM judges. Three fixes that worked for us: Set temperature=0 on your evaluation model — if it's not fixed, you'll get different scores every run Also check your custom template for overlapping criteria — legal queries often have multiple valid interpretations and the judge scores differently depending on which lens it applies first. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Seeking Advice: Improving Scoring Consistency with Custom AnswerRelevancyTemplate
Context
I'm working on evaluating a legal assistant AI system using DeepEval with a custom
AnswerRelevancyTemplate. The system handles various types of queries including legal guidance, factual lookups, directory searches, and policy references.Challenge
I'm experiencing inconsistent scoring when evaluating the same LLM outputs against a golden dataset with expected responses. The same actual output can receive different relevancy scores across multiple evaluation runs, which makes it difficult to:
Current Implementation
My custom template extends
AnswerRelevancyTemplatewith two main methods:The template includes detailed examples for different response types (legal guidance, factual lookups, firm finder results, policy references, etc.).
Specific Questions
LLM Variability: What strategies have you found most effective for reducing LLM variability in custom templates?
Statement Granularity: I'm seeing inconsistency in how statements are evaluated between runs. Sometimes the evaluation will pass and score high for certain cases, but in a subsequent run it will find issues with the case it had given a high score in the previous run. This seems to be related to the non deterministic nature of of using LLM as a judge.
Golden Dataset Evaluation: When comparing against expected responses, should I be:
Evaluation Infrastructure: Are there DeepEval configuration options or evaluation patterns that help with consistency? (e.g., running multiple evaluations and averaging, specific model configurations, etc.)
What I've Tried
Looking For
Has anyone dealt with similar consistency challenges in custom evaluation templates, especially in specialized domains? Would love to hear about approaches that have worked well.
Here's my template:
Note: Examples in my template have been anonymized but maintain the structural complexity of real legal assistant responses.
Beta Was this translation helpful? Give feedback.
All reactions