Summary
Turn._mcp_interaction is declared as a PrivateAttr(default=False) and is intended to be set to True in the mode="before" model validator when mcp_tools_called, mcp_resources_called, or mcp_prompts_called are present. Due to a Pydantic v2 incompatibility, this never works — the flag stays False for every turn, causing MultiTurnMCPUseMetric and MCPTaskCompletionMetric to produce severely degraded scores on all conversational test cases that use MCP tool calls.
deepeval version: 3.9.2
pydantic version: 2.12.5
Root Cause
In deepeval/test_case/conversational_test_case.py, the mode="before" validator sets:
data["_mcp_interaction"] = True
In Pydantic v1, this worked because private attributes were initialized from the constructor data dict. In Pydantic v2, PrivateAttr fields are explicitly excluded from __init__ and from the validated field set — any key starting with _ that isn't a model field is silently dropped after the validator returns. The __pydantic_private__ dict is initialized to {"_mcp_interaction": False} regardless.
You can verify:
import mcp.types
from deepeval.test_case import MCPToolCall, Turn
result = mcp.types.CallToolResult(content=[], structuredContent={"result": {}}, isError=False)
t = Turn(
role="assistant",
content="Looking up...",
mcp_tools_called=[MCPToolCall(name="lookup_legislator", args={}, result=result)],
)
print(t.__pydantic_private__) # {'_mcp_interaction': False}
print(t._mcp_interaction) # False — should be True
Impact
MultiTurnMCPUseMetric._get_tasks() and MCPTaskCompletionMetric._get_tasks() gate all tool-call rendering on turn._mcp_interaction:
if turn._mcp_interaction:
# render <Tool Called> block for the judge
else:
new_task.steps_taken.append("Agent's response to user: \n" + turn.content)
Since _mcp_interaction is always False, every turn — including turns with mcp_tools_called set — falls into the else branch. The judge only ever sees turn.content for MCP turns, never the structured tool call details (tool name, args, or result). The judge prompt never contains any <Tool Called> sections, so the judge has no visibility into what tools were actually invoked or what they returned, leading to severely degraded scores.
Issue #2138 / PR #2141 patched a downstream ZeroDivisionError caused by this bug (empty task lists), but didn't address the root cause.
Fix
Replace the PrivateAttr + broken validator with a @property. The value is fully derivable from existing fields — there's no reason to store it:
# Remove this:
_mcp_interaction: bool = PrivateAttr(default=False)
# And the data["_mcp_interaction"] = True line in the validator.
# Add this:
@property
def _mcp_interaction(self) -> bool:
return (
self.mcp_tools_called is not None
or self.mcp_resources_called is not None
or self.mcp_prompts_called is not None
)
Alternatively, change the validator to mode="after" so self is the live instance:
@model_validator(mode="after")
def set_mcp_interaction(self):
if (
self.mcp_tools_called is not None
or self.mcp_resources_called is not None
or self.mcp_prompts_called is not None
):
self._mcp_interaction = True
return self
The @property approach is cleaner since it eliminates stored state entirely.
Secondary Issue: turn.content Ignored on MCP Turns
A related gap in _get_tasks(): when _mcp_interaction is True, the method renders only the tool name, args, and structuredContent result — it ignores turn.content entirely. I'd like to use turn.content on MCP turns as a user-visible status message (e.g. "Looking up your address..."). This context is never surfaced to the judge, so the judge evaluates the conversation as if the user experienced a silent gap during tool execution. Including turn.content when non-empty would give the judge a more accurate picture of what the user actually saw.
Summary
Turn._mcp_interactionis declared as aPrivateAttr(default=False)and is intended to be set toTruein themode="before"model validator whenmcp_tools_called,mcp_resources_called, ormcp_prompts_calledare present. Due to a Pydantic v2 incompatibility, this never works — the flag staysFalsefor every turn, causingMultiTurnMCPUseMetricandMCPTaskCompletionMetricto produce severely degraded scores on all conversational test cases that use MCP tool calls.deepeval version: 3.9.2
pydantic version: 2.12.5
Root Cause
In
deepeval/test_case/conversational_test_case.py, themode="before"validator sets:In Pydantic v1, this worked because private attributes were initialized from the constructor data dict. In Pydantic v2,
PrivateAttrfields are explicitly excluded from__init__and from the validated field set — any key starting with_that isn't a model field is silently dropped after the validator returns. The__pydantic_private__dict is initialized to{"_mcp_interaction": False}regardless.You can verify:
Impact
MultiTurnMCPUseMetric._get_tasks()andMCPTaskCompletionMetric._get_tasks()gate all tool-call rendering onturn._mcp_interaction:Since
_mcp_interactionis alwaysFalse, every turn — including turns withmcp_tools_calledset — falls into theelsebranch. The judge only ever seesturn.contentfor MCP turns, never the structured tool call details (tool name, args, or result). The judge prompt never contains any<Tool Called>sections, so the judge has no visibility into what tools were actually invoked or what they returned, leading to severely degraded scores.Issue #2138 / PR #2141 patched a downstream
ZeroDivisionErrorcaused by this bug (empty task lists), but didn't address the root cause.Fix
Replace the
PrivateAttr+ broken validator with a@property. The value is fully derivable from existing fields — there's no reason to store it:Alternatively, change the validator to
mode="after"soselfis the live instance:The
@propertyapproach is cleaner since it eliminates stored state entirely.Secondary Issue:
turn.contentIgnored on MCP TurnsA related gap in
_get_tasks(): when_mcp_interactionisTrue, the method renders only the tool name, args, andstructuredContentresult — it ignoresturn.contententirely. I'd like to useturn.contenton MCP turns as a user-visible status message (e.g. "Looking up your address..."). This context is never surfaced to the judge, so the judge evaluates the conversation as if the user experienced a silent gap during tool execution. Includingturn.contentwhen non-empty would give the judge a more accurate picture of what the user actually saw.