Inconsistent scoring using a custom AnswerRelevancyTemplate #1674

MRLab12 · 2025-06-09T14:48:34Z

MRLab12
Jun 9, 2025

Seeking Advice: Improving Scoring Consistency with Custom AnswerRelevancyTemplate

Context

I'm working on evaluating a legal assistant AI system using DeepEval with a custom AnswerRelevancyTemplate. The system handles various types of queries including legal guidance, factual lookups, directory searches, and policy references.

Challenge

I'm experiencing inconsistent scoring when evaluating the same LLM outputs against a golden dataset with expected responses. The same actual output can receive different relevancy scores across multiple evaluation runs, which makes it difficult to:

Establish reliable benchmarks
Track model improvements over time
Have confidence in A/B testing results

Current Implementation

My custom template extends AnswerRelevancyTemplate with two main methods:

Statement Generation: Breaks down system responses into factual/legal statements (handles various response types like legal guidance, directory lookups, policy references)
Verdict Assessment: Evaluates each statement's relevance using 'yes'/'no'/'idk' classifications

The template includes detailed examples for different response types (legal guidance, factual lookups, firm finder results, policy references, etc.).

Specific Questions

LLM Variability: What strategies have you found most effective for reducing LLM variability in custom templates?
Statement Granularity: I'm seeing inconsistency in how statements are evaluated between runs. Sometimes the evaluation will pass and score high for certain cases, but in a subsequent run it will find issues with the case it had given a high score in the previous run. This seems to be related to the non deterministic nature of of using LLM as a judge.
Golden Dataset Evaluation: When comparing against expected responses, should I be:
- Pre-processing both actual and expected outputs through the same statement generation?
- Implementing some form of statement matching/alignment?
Evaluation Infrastructure: Are there DeepEval configuration options or evaluation patterns that help with consistency? (e.g., running multiple evaluations and averaging, specific model configurations, etc.)

What I've Tried

Added more detailed examples to reduce ambiguity
Experimented with different eval statement structures

Looking For

Best practices from others using custom templates
Strategies for handling domain-specific evaluation challenges
Ideas for making legal/compliance-focused evaluations more deterministic
Any DeepEval-specific tips for improving consistency

Has anyone dealt with similar consistency challenges in custom evaluation templates, especially in specialized domains? Would love to hear about approaches that have worked well.

Here's my template:

from deepeval.metrics.answer_relevancy import AnswerRelevancyTemplate


class LegalAnswerRelevancyTemplate(AnswerRelevancyTemplate):
    @staticmethod
    def generate_statements(actual_output: str):
        return f"""Given the system response, break down and generate a list of factual or legal statements presented. Each statement should be concise, self-contained, and reflect a single fact, rule, or piece of information from the response. This applies to both legal guidance and factual lookups (such as names, contact details, or team descriptions). Do not split statements unnecessarily; if a sentence contains multiple related facts that together address the input, keep them as one statement. Focus on capturing the main legal or factual points relevant to the user's query, even if the wording is paraphrased or statutes are referenced generally rather than by number.

Example 1 (Legal Guidance):
System Response:
Employees at the level of Director or higher may execute a contract on behalf of any controlled ExampleCorp subsidiary for their business or function, provided that the contract has been approved pursuant to the business unit's approval process.

{{
    "statements": [
        "Employees at the level of Director or higher may execute a contract on behalf of any controlled ExampleCorp subsidiary for their business or function, provided that the contract has been approved pursuant to the business unit's approval process."
    ]
}}
===== END OF EXAMPLE ======

Example 2 (Factual Lookup):
System Response:
The Corporate Legal team provides advice and support for ExampleCorp's business operations. The team's members are Alice Smith, Bob Johnson, and Carol Lee. For more, you can go to their SharePoint page at https://example.sharepoint.com/sites/corporate-legal.

{{
    "statements": [
        "The Corporate Legal team provides advice and support for ExampleCorp's business operations. The team's members are Alice Smith, Bob Johnson, and Carol Lee. More information is available at https://example.sharepoint.com/sites/corporate-legal."
    ]
}}
===== END OF EXAMPLE ======

Example 3 (Firm Finder):
System Response:
ExampleCorp has worked with the law firm Smith & Partners LLP on intellectual property matters in the past three years. For more information, see the firm directory at https://example.sharepoint.com/sites/firmfinder.

{{
    "statements": [
        "ExampleCorp has worked with the law firm Smith & Partners LLP on intellectual property matters in the past three years.",
        "More information is available at https://example.sharepoint.com/sites/firmfinder."
    ]
}}
===== END OF EXAMPLE ======

Example 4 (Policy Reference):
System Response:
ExampleCorp's social media posting policies are available at the following URLs: https://example.sharepoint.com/sites/policies/social-media-guidelines.pdf and https://example.sharepoint.com/sites/policies/influencer-agreement-template.pdf.

{{
    "statements": [
        "ExampleCorp's social media posting policies are available at https://example.sharepoint.com/sites/policies/social-media-guidelines.pdf.",
        "ExampleCorp's influencer agreement template is available at https://example.sharepoint.com/sites/policies/influencer-agreement-template.pdf."
    ]
}}
===== END OF EXAMPLE ======

Example 5 (No Relevant Information):
System Response:
That's not something I can answer with confidence based on what's available to me. Try the search function on the ExampleCorp homepage.

{{
    "statements": [
        "The system could not answer the question based on available information and suggested using the search function on the ExampleCorp homepage."
    ]
}}
===== END OF EXAMPLE ======

Example 6 (Directory Lookup):
System Response:
You can ask the Legal Assistant legal-related questions and it will search across ExampleCorp resources to provide you with the most relevant information. You can also ask for information about professionals within ExampleCorp who work on particular legal topics, and the Legal Assistant will search the directory to find relevant people who you can contact. Additionally, you can ask about non-legal questions as long as they relate to content available within the Legal Assistant, such as trainings or committee pages. The Legal Assistant is equipped to provide accurate, up-to-date information based on its resources. Note that the Legal Assistant does not provide legal advice or make decisions. It only shares guidance based on the available resources. Please consult colleagues in the legal function for more information about the guidance provided by the Legal Assistant.

{{
    "statements": [
        "The Legal Assistant can answer legal-related questions by searching ExampleCorp resources.",
        "The Legal Assistant can provide information about professionals within ExampleCorp who work on particular legal topics.",
        "The Legal Assistant can answer non-legal questions if they relate to content available within the Legal Assistant, such as trainings or committee pages.",
        "The Legal Assistant provides accurate, up-to-date information based on its resources.",
        "The Legal Assistant does not provide legal advice or make decisions.",
        "Guidance is based on available resources and users should consult colleagues in the legal function for more information."
    ]
}}
===== END OF EXAMPLE ======

Example 7 (Newsletter Reference):
System Response:
All the articles and other content from legal newsletters can be found [here](https://example.sharepoint.com/sites/legal-news).

{{
    "statements": [
        "All the articles and other content from legal newsletters can be found at https://example.sharepoint.com/sites/legal-news."
    ]
}}
===== END OF EXAMPLE ======

Example 8 (Privacy Guidance):
System Response:
To determine if you need a Privacy Impact Assessment (PIA) or other privacy review, you can refer to the ExampleCorp Privacy Review Process guidance. A PIA is needed if a project, process, or system involves handling personal data. Personal data is any data that links or can be linked to an individual or household and includes direct identifiers (like emails), indirect identifiers (like IP address or device IDs), and behavioral information (like geolocation or what article someone clicked on). If you are dealing with vendors or third parties who have access to ExampleCorp systems or data, they must be vetted to ensure they use proper technical tools and organization processes to securely protect ExampleCorp information and systems. The vetting process is part of ensuring privacy compliance and might necessitate a Privacy Impact Assessment. A PIA is also needed for projects or systems that process personal data even if no vendor is used (for example an internally built database). For more detailed guidance, you can consult the Privacy Legal Team at privacy@example.com or refer to the [ExampleCorp Privacy Review Process](https://example.sharepoint.com/sites/privacy/SitePages/Privacy-Review-Process.aspx) for more information.

{{
    "statements": [
        "To determine if you need a Privacy Impact Assessment (PIA) or other privacy review, refer to the ExampleCorp Privacy Review Process guidance.",
        "A PIA is needed if a project, process, or system involves handling personal data.",
        "Personal data includes direct identifiers, indirect identifiers, and behavioral information.",
        "Vendors or third parties with access to ExampleCorp systems or data must be vetted for proper security.",
        "The vetting process is part of privacy compliance and may require a Privacy Impact Assessment.",
        "A PIA is needed for projects or systems that process personal data even if no vendor is used.",
        "For more guidance, consult the Privacy Legal Team at privacy@example.com or refer to https://example.sharepoint.com/sites/privacy/SitePages/Privacy-Review-Process.aspx."
    ]
}}
===== END OF EXAMPLE ======

Example 9 (Community Service Page):
System Response:
You can find the Community Engagement Committee page at https://example.sharepoint.com/sites/community-engagement.

{{
    "statements": [
        "The Community Engagement Committee page is available at https://example.sharepoint.com/sites/community-engagement."
    ]
}}
===== END OF EXAMPLE ======

Text:
{actual_output}

JSON:
"""

    @staticmethod
    def generate_verdicts(input: str, statements: str):
        return f"""For the provided list of statements, determine whether each statement is relevant to address the input.
Please generate a list of JSON with two keys: `verdict` and `reason`.
The 'verdict' key should STRICTLY be either a 'yes', 'idk' or 'no'. Answer 'yes' if the statement is relevant to addressing the original input, 'no' if the statement is irrelevant, and 'idk' if it is ambiguous (eg., not directly relevant but could be used as a supporting point to address the input).
The 'reason' is the reason for the verdict.
Provide a 'reason' ONLY if the answer is 'no'.
The provided statements are statements made in the actual output.

**
IMPORTANT: Please make sure to only return in JSON format, with the 'verdicts' key mapping to a list of JSON objects.

Example input:
Who handles international content licensing?

Example statements:
[
    "Alex Green handles international content licensing at ExampleCorp.",
    "Alex Green's title is Associate Legal Counsel.",
    "Alex Green works in the Business & Legal Affairs division.",
    "Alex Green's email is alex.green@example.com.",
    "Alex Green is located in London, UK.",
    "Alex Green's profile can be found at Alex Green's Profile.",
    "Jamie Lee handles international content licensing at ExampleCorp.",
    "Jamie Lee's title is VP Legal and Business Affairs.",
    "Jamie Lee works in the Business & Legal UK division.",
    "Jamie Lee's email is jamie.lee@example.com.",
    "Jamie Lee is located in Paris, France.",
    "Jamie Lee's profile can be found at Jamie Lee's Profile."
]

Example JSON:
{{
    "verdicts": [
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "yes"}}
    ]
}}
===== END OF EXAMPLE ======

Example input:
Do you have the link to the latest town hall?

Example statements:
[
    "The link to the latest Town Hall is available at https://example.sharepoint.com/sites/legal-news.",
    "You can find the latest town hall content at the following link: [Legal News](https://example.sharepoint.com/sites/legal-news).",
    "You can find the latest town hall and other related content on Legal News."
]

Example JSON:
{{
    "verdicts": [
        {{"verdict": "yes"}},
        {{"verdict": "yes"}},
        {{"verdict": "no", "reason": "The statement does not provide a specific link for the latest Town Hall."}}
    ]
}}
===== END OF EXAMPLE ======

Input:
{input}

Statements:
{statements}

JSON:
"""

Note: Examples in my template have been anonymized but maintain the structural complexity of real legal assistant responses.

akashcodes23 · 2026-04-03T14:51:36Z

akashcodes23
Apr 3, 2026

Great question — this is a known challenge in LLM evaluation called scoring instability, and it comes from the non-deterministic nature of LLM-as-judge pipelines.
A few things that helped us in our own hallucination detection work:

Set temperature=0 on your evaluation model. DeepEval uses an LLM judge internally — if it's not running at temperature 0, you'll get different scores for the same input every time.
Use ensemble scoring. Instead of a single evaluation pass, run the metric 3–5 times and take the majority vote or average. This is similar to the self-consistency approach in uncertainty estimation literature.
Add a confidence threshold gate. Rather than trusting the raw score, only accept verdicts where the judge's internal confidence is above a threshold (e.g., 0.75). Scores near the boundary (like 0.45–0.55) are inherently unstable — flag these as "uncertain" rather than forcing a binary pass/fail.
Check your custom template for ambiguous criteria. Legal queries often have overlapping categories — if your template criteria aren't mutually exclusive, the judge will score differently depending on which lens it applies first.
Hope this helps — happy to share more about the confidence-aware evaluation approach we've been using.

0 replies

akashcodes23 · 2026-04-04T09:40:15Z

akashcodes23
Apr 4, 2026

This is scoring instability from non-deterministic LLM judges. Three fixes that worked for us:

Set temperature=0 on your evaluation model — if it's not fixed, you'll get different scores every run
Run the metric 3x and take majority vote — self-consistency approach
Add a confidence threshold gate — scores between 0.45–0.55 are inherently unstable, flag them as "uncertain" rather than forcing pass/fail

Also check your custom template for overlapping criteria — legal queries often have multiple valid interpretations and the judge scores differently depending on which lens it applies first.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent scoring using a custom AnswerRelevancyTemplate #1674

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Inconsistent scoring using a custom AnswerRelevancyTemplate #1674

Uh oh!

MRLab12 Jun 9, 2025

Seeking Advice: Improving Scoring Consistency with Custom AnswerRelevancyTemplate

Context

Challenge

Current Implementation

Specific Questions

What I've Tried

Looking For

Replies: 2 comments

Uh oh!

akashcodes23 Apr 3, 2026

Uh oh!

akashcodes23 Apr 4, 2026

MRLab12
Jun 9, 2025

akashcodes23
Apr 3, 2026

akashcodes23
Apr 4, 2026