Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
82 commits
Select commit Hold shift + click to select a range
4318329
Prepare evals SDK Release
May 28, 2025
192b980
Fix bug
May 28, 2025
758adb4
Fix for ADV_CONV for FDP projects
May 29, 2025
de09fd1
Update release date
May 29, 2025
ef60fe6
Merge branch 'main' into main
nagkumar91 May 29, 2025
8ca51d0
Merge branch 'Azure:main' into main
nagkumar91 May 30, 2025
98bfc3a
Merge branch 'Azure:main' into main
nagkumar91 Jun 2, 2025
a5f32e8
Merge branch 'Azure:main' into main
nagkumar91 Jun 9, 2025
5fd88b6
Merge branch 'Azure:main' into main
nagkumar91 Jun 10, 2025
51f2b44
Merge branch 'Azure:main' into main
nagkumar91 Jun 10, 2025
a5be8b5
Merge branch 'Azure:main' into main
nagkumar91 Jun 16, 2025
75965b7
Merge branch 'Azure:main' into main
nagkumar91 Jun 25, 2025
d0c5e53
Merge branch 'Azure:main' into main
nagkumar91 Jun 25, 2025
b790276
Merge branch 'Azure:main' into main
nagkumar91 Jun 26, 2025
d5ca243
Merge branch 'Azure:main' into main
nagkumar91 Jun 26, 2025
8d62e36
re-add pyrit to matrix
Jun 26, 2025
59a70f2
Change grader ids
Jun 26, 2025
4d146d7
Merge branch 'Azure:main' into main
nagkumar91 Jun 26, 2025
f7a4c83
Update unit test
Jun 27, 2025
79e3a40
replace all old grader IDs in tests
Jun 27, 2025
588cbec
Merge branch 'main' into main
nagkumar91 Jun 30, 2025
7514472
Update platform-matrix.json
nagkumar91 Jun 30, 2025
28b2513
Update test to ensure everything is mocked
Jul 1, 2025
8603e0e
tox/black fixes
Jul 1, 2025
895f226
Skip that test with issues
Jul 1, 2025
b4b2daf
Merge branch 'Azure:main' into main
nagkumar91 Jul 1, 2025
023f07f
update grader ID according to API View feedback
Jul 1, 2025
45b5f5d
Update test
Jul 2, 2025
1ccb4db
remove string check for grader ID
Jul 2, 2025
6fd9aa5
Merge branch 'Azure:main' into main
nagkumar91 Jul 2, 2025
f871855
Update changelog and officialy start freeze
Jul 2, 2025
59ac230
update the enum according to suggestions
Jul 2, 2025
794a2c4
update the changelog
Jul 2, 2025
b33363c
Finalize logic
Jul 2, 2025
464e2dd
Merge branch 'Azure:main' into main
nagkumar91 Jul 3, 2025
4585b14
Merge branch 'Azure:main' into main
nagkumar91 Jul 7, 2025
89c2988
Initial plan
Copilot Jul 7, 2025
6805018
Fix client request ID headers in azure-ai-evaluation
Copilot Jul 7, 2025
aad48df
Fix client request ID header format in rai_service.py
Copilot Jul 7, 2025
db75552
Merge pull request #5 from nagkumar91/copilot/fix-4
nagkumar91 Jul 10, 2025
b8eebf3
Merge branch 'Azure:main' into main
nagkumar91 Jul 10, 2025
2899ad4
Merge branch 'Azure:main' into main
nagkumar91 Jul 10, 2025
c431563
Merge branch 'Azure:main' into main
nagkumar91 Jul 17, 2025
79ed63c
Merge branch 'Azure:main' into main
nagkumar91 Jul 18, 2025
a3be3fc
Merge branch 'Azure:main' into main
nagkumar91 Jul 21, 2025
056ac4d
Passing threshold in AzureOpenAIScoreModelGrader
Jul 21, 2025
1779059
Add changelog
Jul 21, 2025
43fecff
Adding the self.pass_threshold instead of pass_threshold
Jul 21, 2025
b0c102b
Merge branch 'Azure:main' into main
nagkumar91 Jul 22, 2025
7bf5f1f
Add the python grader
Jul 22, 2025
3248ad0
Remove redundant test
Jul 22, 2025
d76f59b
Add class to exception list and format code
Jul 23, 2025
4d60e43
Merge branch 'main' into feature/python_grader
nagkumar91 Jul 24, 2025
98d1626
Merge branch 'Azure:main' into main
nagkumar91 Jul 24, 2025
9248c38
Add properties to evaluation upload run for FDP
Jul 24, 2025
74b760f
Remove debug
Jul 24, 2025
23dbc85
Merge branch 'feature/python_grader'
Jul 24, 2025
467ccb6
Remove the redundant property
Jul 24, 2025
c2beee8
Merge branch 'Azure:main' into main
nagkumar91 Jul 24, 2025
be9a19a
Fix changelog
Jul 24, 2025
de3a1e1
Fix the multiple features added section
Jul 24, 2025
f9faa61
removed the properties in update
Jul 24, 2025
69e783a
Merge branch 'Azure:main' into main
nagkumar91 Jul 28, 2025
8ebea2a
Merge branch 'Azure:main' into main
nagkumar91 Jul 31, 2025
3f9c818
Merge branch 'Azure:main' into main
nagkumar91 Aug 1, 2025
3b3159c
Merge branch 'Azure:main' into main
nagkumar91 Aug 5, 2025
d78b834
Merge branch 'Azure:main' into main
nagkumar91 Aug 6, 2025
ae3fc52
Merge branch 'Azure:main' into main
nagkumar91 Aug 8, 2025
19cce75
evaluation: support is_reasoning_model across all prompty-based evalu…
Aug 8, 2025
e59ca7f
evaluation: docs(Preview) + groundedness feature-detection + is_reaso…
Aug 8, 2025
98b4618
evaluation: revert _proxy_completion_model.py to origin/main version
Aug 8, 2025
706c042
Merge branch 'Azure:main' into main
nagkumar91 Aug 11, 2025
c418513
Merge remote-tracking branch 'origin/main' into diff-20250811-171736
Aug 12, 2025
86f24ba
Restore files that shouldn't have been modified
Aug 12, 2025
a1e55b4
Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evalua…
nagkumar91 Aug 12, 2025
bd6809f
Update the groundedness based on comments
Aug 12, 2025
3ae37cb
Add changelog to bug fix and link issue
Aug 12, 2025
6b8d4ce
Fix docstring
Aug 12, 2025
733ee1a
lint fixes
Aug 12, 2025
8f39719
Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evalua…
nagkumar91 Aug 19, 2025
3cebd64
Support for reasoning models using the right client
Aug 20, 2025
d23fd3c
Formatting
Aug 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,19 @@
### Breaking Changes

### Features Added

- Added support for user-supplied tags in the `evaluate` function. Tags are key-value pairs that can be used for experiment tracking, A/B testing, filtering, and organizing evaluation runs. The function accepts a `tags` parameter.
- Enhanced `GroundednessEvaluator` to support AI agent evaluation with tool calls. The evaluator now accepts agent response data containing tool calls and can extract context from `file_search` tool results for groundedness assessment. This enables evaluation of AI agents that use tools to retrieve information and generate responses. Note: Agent groundedness evaluation is currently supported only when the `file_search` tool is used.

### Bugs Fixed

- [Bug](https://github.com/Azure/azure-sdk-for-python/issues/39909): Added `is_reasoning_model` keyword parameter to all evaluators
(`SimilarityEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`,
`RetrievalEvaluator`, `GroundednessEvaluator`, `IntentResolutionEvaluator`,
`ResponseCompletenessEvaluator`, `TaskAdherenceEvaluator`, `ToolCallAccuracyEvaluator`).
When set, evaluator configuration is adjusted appropriately for reasoning models.
`QAEvaluator` now propagates this parameter to its child evaluators.

### Other Changes

## 1.10.0 (2025-07-31)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -178,8 +178,6 @@ def get_metrics(self, client_run: BatchClientRun) -> Dict[str, Any]:
run = self._get_result(client_run)
try:
aggregated_metrics = run.get_aggregated_metrics()
print("Aggregated metrics")
print(aggregated_metrics)
except Exception as ex: # pylint: disable=broad-exception-caught
LOGGER.debug("Error calculating metrics for evaluator %s, failed with error %s", run.evaluator_name, ex)
return {}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1028,36 +1028,36 @@ def _preprocess_data(
batch_run_data: Union[str, os.PathLike, pd.DataFrame] = data

def get_client_type(evaluate_kwargs: Dict[str, Any]) -> Literal["run_submitter", "pf_client", "code_client"]:
"""Determines the BatchClient to use from provided kwargs (_use_run_submitter_client and _use_pf_client)"""
_use_run_submitter_client = cast(Optional[bool], kwargs.pop("_use_run_submitter_client", None))
_use_pf_client = cast(Optional[bool], kwargs.pop("_use_pf_client", None))
"""Pick batch client from kwargs: _use_run_submitter_client and _use_pf_client."""
_use_run = cast(Optional[bool], evaluate_kwargs.pop("_use_run_submitter_client", None))
_use_pf = cast(Optional[bool], evaluate_kwargs.pop("_use_pf_client", None))

if _use_run_submitter_client is None and _use_pf_client is None:
# If both are unset, return default
if _use_run is None and _use_pf is None:
return "run_submitter"

if _use_run_submitter_client and _use_pf_client:
if _use_run and _use_pf:
raise EvaluationException(
message="Only one of _use_pf_client and _use_run_submitter_client should be set to True.",
message=("Only one of _use_pf_client and _use_run_submitter_client " "should be set to True."),
target=ErrorTarget.EVALUATE,
category=ErrorCategory.INVALID_VALUE,
blame=ErrorBlame.USER_ERROR,
)

if _use_run_submitter_client == False and _use_pf_client == False:
if _use_run is False and _use_pf is False:
return "code_client"

if _use_run_submitter_client:
if _use_run:
return "run_submitter"
if _use_pf_client:
if _use_pf:
return "pf_client"

if _use_run_submitter_client is None and _use_pf_client == False:
if _use_run is None and _use_pf is False:
return "run_submitter"
if _use_run_submitter_client == False and _use_pf_client is None:
if _use_run is False and _use_pf is None:
return "pf_client"

assert False, "This should be impossible"
# Defensive default
return "run_submitter"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we let it fail like before ?


client_type: Literal["run_submitter", "pf_client", "code_client"] = get_client_type(kwargs)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,22 @@

class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
"""
Evaluates coherence score for a given query and response or a multi-turn conversation, including reasoning.
Evaluates coherence for a given query and response or a multi-turn
conversation, including reasoning.

The coherence measure assesses the ability of the language model to generate text that reads naturally,
flows smoothly, and resembles human-like language in its responses. Use it when assessing the readability
and user-friendliness of a model's generated responses in real-world applications.
The coherence measure assesses the model's ability to generate text that
reads naturally, flows smoothly, and resembles human-like language. Use it
when assessing the readability and user-friendliness of responses.

:param model_config: Configuration for the Azure OpenAI model.
:type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
:type model_config:
Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
~azure.ai.evaluation.OpenAIModelConfiguration]
:param threshold: The threshold for the coherence evaluator. Default is 3.
:type threshold: int
:keyword is_reasoning_model: (Preview) config for chat completions is
updated to use reasoning models
:type is_reasoning_model: bool

.. admonition:: Example:

Expand All @@ -31,7 +36,8 @@ class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
:end-before: [END coherence_evaluator]
:language: python
:dedent: 8
:caption: Initialize and call CoherenceEvaluator using azure.ai.evaluation.AzureAIProject
:caption: Initialize and call CoherenceEvaluator using
azure.ai.evaluation.AzureAIProject

.. admonition:: Example using Azure AI Project URL:

Expand All @@ -40,7 +46,8 @@ class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
:end-before: [END coherence_evaluator]
:language: python
:dedent: 8
:caption: Initialize and call CoherenceEvaluator using Azure AI Project URL in following format
:caption: Initialize and call CoherenceEvaluator using Azure AI
Project URL in following format
https://{resource_name}.services.ai.azure.com/api/projects/{project_name}

.. admonition:: Example with Threshold:
Expand All @@ -50,23 +57,24 @@ class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
:end-before: [END threshold_coherence_evaluator]
:language: python
:dedent: 8
:caption: Initialize with threshold and call a CoherenceEvaluator with a query and response.
:caption: Initialize with threshold and call a CoherenceEvaluator
with a query and response.

.. note::

To align with our support of a diverse set of models, an output key without the `gpt_` prefix has been added.
To maintain backwards compatibility, the old key with the `gpt_` prefix is still be present in the output;
however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
To align with support of diverse models, an output key without the
`gpt_` prefix has been added. The old key with the `gpt_` prefix is
still present for compatibility; however, it will be deprecated.
"""

_PROMPTY_FILE = "coherence.prompty"
_RESULT_KEY = "coherence"

id = "azureai://built-in/evaluators/coherence"
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""
"""Evaluator identifier, experimental to be used only with cloud evaluation"""

@override
def __init__(self, model_config, *, threshold=3):
def __init__(self, model_config, *, threshold=3, **kwargs):
current_dir = os.path.dirname(__file__)
prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
self._threshold = threshold
Expand All @@ -77,6 +85,7 @@ def __init__(self, model_config, *, threshold=3):
result_key=self._RESULT_KEY,
threshold=threshold,
_higher_is_better=self._higher_is_better,
**kwargs,
)

@overload
Expand Down Expand Up @@ -104,9 +113,11 @@ def __call__(
) -> Dict[str, Union[float, Dict[str, List[Union[str, float]]]]]:
"""Evaluate coherence for a conversation

:keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
key "messages", and potentially a global context under the key "context". Conversation turns are expected
to be dictionaries with keys "content", "role", and possibly "context".
:keyword conversation: The conversation to evaluate. Expected to
contain a list of conversation turns under the key "messages",
and optionally a global context under the key "context". Turns are
dictionaries with keys "content", "role", and possibly
"context".
:paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
:return: The coherence score.
:rtype: Dict[str, Union[float, Dict[str, List[float]]]]
Expand All @@ -118,19 +129,22 @@ def __call__( # pylint: disable=docstring-missing-param
*args,
**kwargs,
):
"""Evaluate coherence. Accepts either a query and response for a single evaluation,
or a conversation for a potentially multi-turn evaluation. If the conversation has more than one pair of
turns, the evaluator will aggregate the results of each turn.
"""Evaluate coherence.

Accepts a query/response for a single evaluation, or a conversation
for a multi-turn evaluation. If the conversation has more than one
pair of turns, results are aggregated.

:keyword query: The query to be evaluated.
:paramtype query: str
:keyword response: The response to be evaluated.
:paramtype response: Optional[str]
:keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
key "messages". Conversation turns are expected
to be dictionaries with keys "content" and "role".
:keyword conversation: The conversation to evaluate. Expected to
contain conversation turns under the key "messages" as
dictionaries with keys "content" and "role".
:paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
:return: The relevance score.
:rtype: Union[Dict[str, float], Dict[str, Union[float, Dict[str, List[float]]]]]
:rtype: Union[Dict[str, float], Dict[str, Union[float, Dict[str,
List[float]]]]]
"""
return super().__call__(*args, **kwargs)
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
from concurrent.futures import as_completed
from typing import TypeVar, Dict, List

from azure.ai.evaluation._legacy._adapters.tracing import ThreadPoolExecutorWithContext as ThreadPoolExecutor
from azure.ai.evaluation._legacy._adapters.tracing import (
ThreadPoolExecutorWithContext as ThreadPoolExecutor,
)
from typing_extensions import override

from azure.ai.evaluation._evaluators._common import EvaluatorBase
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,22 +3,35 @@
# ---------------------------------------------------------

import math
import re
import os
import re
from typing import Dict, TypeVar, Union

if os.getenv("AI_EVALS_USE_PF_PROMPTY", "false").lower() == "true":
from promptflow.core._flow import AsyncPrompty
else:
from azure.ai.evaluation._legacy.prompty import AsyncPrompty
from azure.ai.evaluation._legacy.prompty import (
AsyncPrompty as _LegacyAsyncPrompty,
)
from typing_extensions import override

from azure.ai.evaluation._common.constants import PROMPT_BASED_REASON_EVALUATORS
from azure.ai.evaluation._common.constants import (
PROMPT_BASED_REASON_EVALUATORS,
)
from azure.ai.evaluation._constants import EVALUATION_PASS_FAIL_MAPPING
from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
from ..._common.utils import construct_prompty_model_config, validate_model_config, parse_quality_evaluator_reason_score
from azure.ai.evaluation._exceptions import (
EvaluationException,
ErrorBlame,
ErrorCategory,
ErrorTarget,
)
from ..._common.utils import (
construct_prompty_model_config,
validate_model_config,
parse_quality_evaluator_reason_score,
)
from . import EvaluatorBase

_PFAsyncPrompty = None # type: ignore[assignment]


try:
from ..._user_agent import UserAgentSingleton
except ImportError:
Expand Down Expand Up @@ -71,7 +84,11 @@ def __init__(
self._prompty_file = prompty_file
self._threshold = threshold
self._higher_is_better = _higher_is_better
super().__init__(eval_last_turn=eval_last_turn, threshold=threshold, _higher_is_better=_higher_is_better)
super().__init__(
eval_last_turn=eval_last_turn,
threshold=threshold,
_higher_is_better=_higher_is_better,
)

subclass_name = self.__class__.__name__
user_agent = f"{UserAgentSingleton().value} (type=evaluator subtype={subclass_name})"
Expand All @@ -80,9 +97,28 @@ def __init__(
self._DEFAULT_OPEN_API_VERSION,
user_agent,
)

self._flow = AsyncPrompty.load(
source=self._prompty_file, model=prompty_model_config, is_reasoning_model=self._is_reasoning_model
# Choose backend: force legacy prompty for reasoning models so
# parameter adaptation applies
use_pf = os.getenv("AI_EVALS_USE_PF_PROMPTY", "false").lower() == "true"
if self._is_reasoning_model:
AsyncPromptyClass = _LegacyAsyncPrompty
else:
if use_pf and _PFAsyncPrompty is None:
try:
from promptflow.core._flow import ( # type: ignore
AsyncPrompty as _PFAsyncPrompty_import,
)

# assign to module-level for reuse
globals()["_PFAsyncPrompty"] = _PFAsyncPrompty_import
except Exception: # pragma: no cover - PF not available
globals()["_PFAsyncPrompty"] = None
AsyncPromptyClass = _PFAsyncPrompty if (use_pf and _PFAsyncPrompty is not None) else _LegacyAsyncPrompty

self._flow = AsyncPromptyClass.load( # type: ignore[call-arg]
source=self._prompty_file,
model=prompty_model_config,
is_reasoning_model=self._is_reasoning_model,
)

# __call__ not overridden here because child classes have such varied signatures that there's no point
Expand Down
Loading
Loading