Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
4318329
Prepare evals SDK Release
May 28, 2025
192b980
Fix bug
May 28, 2025
758adb4
Fix for ADV_CONV for FDP projects
May 29, 2025
de09fd1
Update release date
May 29, 2025
ef60fe6
Merge branch 'main' into main
nagkumar91 May 29, 2025
8ca51d0
Merge branch 'Azure:main' into main
nagkumar91 May 30, 2025
98bfc3a
Merge branch 'Azure:main' into main
nagkumar91 Jun 2, 2025
a5f32e8
Merge branch 'Azure:main' into main
nagkumar91 Jun 9, 2025
5fd88b6
Merge branch 'Azure:main' into main
nagkumar91 Jun 10, 2025
51f2b44
Merge branch 'Azure:main' into main
nagkumar91 Jun 10, 2025
a5be8b5
Merge branch 'Azure:main' into main
nagkumar91 Jun 16, 2025
75965b7
Merge branch 'Azure:main' into main
nagkumar91 Jun 25, 2025
d0c5e53
Merge branch 'Azure:main' into main
nagkumar91 Jun 25, 2025
b790276
Merge branch 'Azure:main' into main
nagkumar91 Jun 26, 2025
d5ca243
Merge branch 'Azure:main' into main
nagkumar91 Jun 26, 2025
8d62e36
re-add pyrit to matrix
Jun 26, 2025
59a70f2
Change grader ids
Jun 26, 2025
4d146d7
Merge branch 'Azure:main' into main
nagkumar91 Jun 26, 2025
f7a4c83
Update unit test
Jun 27, 2025
79e3a40
replace all old grader IDs in tests
Jun 27, 2025
588cbec
Merge branch 'main' into main
nagkumar91 Jun 30, 2025
7514472
Update platform-matrix.json
nagkumar91 Jun 30, 2025
28b2513
Update test to ensure everything is mocked
Jul 1, 2025
8603e0e
tox/black fixes
Jul 1, 2025
895f226
Skip that test with issues
Jul 1, 2025
b4b2daf
Merge branch 'Azure:main' into main
nagkumar91 Jul 1, 2025
023f07f
update grader ID according to API View feedback
Jul 1, 2025
45b5f5d
Update test
Jul 2, 2025
1ccb4db
remove string check for grader ID
Jul 2, 2025
6fd9aa5
Merge branch 'Azure:main' into main
nagkumar91 Jul 2, 2025
f871855
Update changelog and officialy start freeze
Jul 2, 2025
59ac230
update the enum according to suggestions
Jul 2, 2025
794a2c4
update the changelog
Jul 2, 2025
b33363c
Finalize logic
Jul 2, 2025
464e2dd
Merge branch 'Azure:main' into main
nagkumar91 Jul 3, 2025
4585b14
Merge branch 'Azure:main' into main
nagkumar91 Jul 7, 2025
89c2988
Initial plan
Copilot Jul 7, 2025
6805018
Fix client request ID headers in azure-ai-evaluation
Copilot Jul 7, 2025
aad48df
Fix client request ID header format in rai_service.py
Copilot Jul 7, 2025
db75552
Merge pull request #5 from nagkumar91/copilot/fix-4
nagkumar91 Jul 10, 2025
b8eebf3
Merge branch 'Azure:main' into main
nagkumar91 Jul 10, 2025
2899ad4
Merge branch 'Azure:main' into main
nagkumar91 Jul 10, 2025
c431563
Merge branch 'Azure:main' into main
nagkumar91 Jul 17, 2025
79ed63c
Merge branch 'Azure:main' into main
nagkumar91 Jul 18, 2025
a3be3fc
Merge branch 'Azure:main' into main
nagkumar91 Jul 21, 2025
056ac4d
Passing threshold in AzureOpenAIScoreModelGrader
Jul 21, 2025
1779059
Add changelog
Jul 21, 2025
43fecff
Adding the self.pass_threshold instead of pass_threshold
Jul 21, 2025
b0c102b
Merge branch 'Azure:main' into main
nagkumar91 Jul 22, 2025
7bf5f1f
Add the python grader
Jul 22, 2025
3248ad0
Remove redundant test
Jul 22, 2025
d76f59b
Add class to exception list and format code
Jul 23, 2025
4d60e43
Merge branch 'main' into feature/python_grader
nagkumar91 Jul 24, 2025
98d1626
Merge branch 'Azure:main' into main
nagkumar91 Jul 24, 2025
9248c38
Add properties to evaluation upload run for FDP
Jul 24, 2025
74b760f
Remove debug
Jul 24, 2025
23dbc85
Merge branch 'feature/python_grader'
Jul 24, 2025
467ccb6
Remove the redundant property
Jul 24, 2025
c2beee8
Merge branch 'Azure:main' into main
nagkumar91 Jul 24, 2025
be9a19a
Fix changelog
Jul 24, 2025
de3a1e1
Fix the multiple features added section
Jul 24, 2025
f9faa61
removed the properties in update
Jul 24, 2025
69e783a
Merge branch 'Azure:main' into main
nagkumar91 Jul 28, 2025
8ebea2a
Merge branch 'Azure:main' into main
nagkumar91 Jul 31, 2025
3f9c818
Merge branch 'Azure:main' into main
nagkumar91 Aug 1, 2025
3b3159c
Merge branch 'Azure:main' into main
nagkumar91 Aug 5, 2025
d78b834
Merge branch 'Azure:main' into main
nagkumar91 Aug 6, 2025
ae3fc52
Merge branch 'Azure:main' into main
nagkumar91 Aug 8, 2025
19cce75
evaluation: support is_reasoning_model across all prompty-based evalu…
Aug 8, 2025
e59ca7f
evaluation: docs(Preview) + groundedness feature-detection + is_reaso…
Aug 8, 2025
98b4618
evaluation: revert _proxy_completion_model.py to origin/main version
Aug 8, 2025
706c042
Merge branch 'Azure:main' into main
nagkumar91 Aug 11, 2025
c418513
Merge remote-tracking branch 'origin/main' into diff-20250811-171736
Aug 12, 2025
86f24ba
Restore files that shouldn't have been modified
Aug 12, 2025
f91ee63
Merge branch 'Azure:main' into main
nagkumar91 Aug 12, 2025
a1e55b4
Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evalua…
nagkumar91 Aug 12, 2025
bd6809f
Update the groundedness based on comments
Aug 12, 2025
3ae37cb
Add changelog to bug fix and link issue
Aug 12, 2025
6b8d4ce
Fix docstring
Aug 12, 2025
733ee1a
lint fixes
Aug 12, 2025
8f39719
Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evalua…
nagkumar91 Aug 19, 2025
7b3b889
Merge branch 'Azure:main' into main
nagkumar91 Aug 20, 2025
3cebd64
Support for reasoning models using the right client
Aug 20, 2025
d23fd3c
Formatting
Aug 21, 2025
21b71cf
Merge branch 'Azure:main' into main
nagkumar91 Sep 9, 2025
823177a
Merge branch 'Azure:main' into main
nagkumar91 Sep 12, 2025
48f630e
Merge branch 'Azure:main' into main
nagkumar91 Sep 15, 2025
33cf2a2
Merge branch 'Azure:main' into main
nagkumar91 Sep 18, 2025
9ab6f9d
Merge remote-tracking branch 'refs/remotes/origin/main'
Sep 19, 2025
ea3a38c
merge: resolve conflicts in evaluation evaluators; align reasoning-mo…
Sep 22, 2025
72388cc
evaluation(evaluate): revert flag renames and defensive default; keep…
Sep 22, 2025
f403b99
Prefer SDK prompty for reasoning; PF wrapper; fix param sanitization;…
Sep 23, 2025
2dde27d
Merge branch 'Azure:main' into diff-20250811-171736
nagkumar91 Sep 23, 2025
4581b0e
prompty: remove AZEVAL_USE_LEGACY_PROMPTY, unify selection via kwargs…
Sep 23, 2025
a8869e5
Delete sdk/evaluation/azure-ai-evaluation/samples/aoai_score_model_gr…
nagkumar91 Sep 24, 2025
cea9be1
Remove tracked .log files and samples directory
Sep 24, 2025
ed1f9f7
Restore samples directory from origin/main
Sep 24, 2025
4bfd6e9
Prompty: robust parameters handling for reasoning models (coalesce No…
Sep 24, 2025
c100fc8
Evaluate: rename get_client_type param to kwargs for clarity
Sep 24, 2025
91567e2
Tests: clarify comments about reasoning models removing temperature/m…
Sep 24, 2025
2f43619
Evaluation refactors and cleanup:
Sep 25, 2025
40fb39c
Apply evaluation refactors and cleanup into diff branch
Sep 25, 2025
2f2ceec
Revert client selection to legacy tri-state behavior to satisfy tests…
Sep 25, 2025
3115a75
Delete sdk/evaluation/azure-ai-evaluation/tests/2025_09_25__08_49.log
nagkumar91 Sep 25, 2025
4ff307d
Delete sdk/evaluation/azure-ai-evaluation/samples/.gitignore
nagkumar91 Sep 25, 2025
6363cc8
lint fixes
nagkumar91 Sep 30, 2025
1b613f9
skip until new way of passing credentials is supported
nagkumar91 Sep 30, 2025
d9845d4
Fix issue
nagkumar91 Sep 30, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions sdk/evaluation/azure-ai-evaluation/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
## 1.11.0 (2025-09-03)

### Features Added

- Added support for user-supplied tags in the `evaluate` function. Tags are key-value pairs that can be used for experiment tracking, A/B testing, filtering, and organizing evaluation runs. The function accepts a `tags` parameter.
- Added support for user-supplied TokenCredentials with LLM based evaluators.
- Enhanced `GroundednessEvaluator` to support AI agent evaluation with tool calls. The evaluator now accepts agent response data containing tool calls and can extract context from `file_search` tool results for groundedness assessment. This enables evaluation of AI agents that use tools to retrieve information and generate responses. Note: Agent groundedness evaluation is currently supported only when the `file_search` tool is used.
Expand All @@ -24,6 +25,13 @@
### Bugs Fixed
- Fixed issue where evaluation results were not properly aligned with input data, leading to incorrect metrics being reported.

- [Bug](https://github.com/Azure/azure-sdk-for-python/issues/39909): Added `is_reasoning_model` keyword parameter to all evaluators
(`SimilarityEvaluator`, `RelevanceEvaluator`, `CoherenceEvaluator`, `FluencyEvaluator`,
`RetrievalEvaluator`, `GroundednessEvaluator`, `IntentResolutionEvaluator`,
`ResponseCompletenessEvaluator`, `TaskAdherenceEvaluator`, `ToolCallAccuracyEvaluator`).
When set, evaluator configuration is adjusted appropriately for reasoning models.
`QAEvaluator` now propagates this parameter to its child evaluators.

### Other Changes
- Deprecating `AdversarialSimulator` in favor of the [AI Red Teaming Agent](https://aka.ms/airedteamingagent-sample). `AdversarialSimulator` will be removed in the next minor release.
- Moved retry configuration constants (`MAX_RETRY_ATTEMPTS`, `MAX_RETRY_WAIT_SECONDS`, `MIN_RETRY_WAIT_SECONDS`) from `RedTeam` class to new `RetryManager` class for better code organization and configurability.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,8 @@ def _is_aoi_model_config(val: object) -> TypeGuard[AzureOpenAIModelConfiguration


def _is_openai_model_config(val: object) -> TypeGuard[OpenAIModelConfiguration]:
return isinstance(val, dict) and all(isinstance(val.get(k), str) for k in ("model"))
# Minimal OpenAI configuration requires a model name; other fields may be provided
return isinstance(val, dict) and isinstance(val.get("model"), str)


def parse_model_config_type(
Expand Down Expand Up @@ -181,6 +182,11 @@ def validate_azure_ai_project(o: object) -> AzureAIProject:


def validate_model_config(config: dict) -> Union[AzureOpenAIModelConfiguration, OpenAIModelConfiguration]:
# Accept minimal OpenAI config (e.g., {"model": "gpt-4o-mini"}) to support
# evaluator initialization and prompty plumbing in tests and dry runs.
if _is_openai_model_config(config):
return cast(OpenAIModelConfiguration, config)

try:
return _validate_typed_dict(config, AzureOpenAIModelConfiguration)
except TypeError:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -178,8 +178,6 @@ def get_metrics(self, client_run: BatchClientRun) -> Dict[str, Any]:
run = self._get_result(client_run)
try:
aggregated_metrics = run.get_aggregated_metrics()
print("Aggregated metrics")
print(aggregated_metrics)
except Exception as ex: # pylint: disable=broad-exception-caught
LOGGER.debug("Error calculating metrics for evaluator %s, failed with error %s", run.evaluator_name, ex)
return {}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1028,12 +1028,15 @@ def _preprocess_data(
batch_run_data: Union[str, os.PathLike, pd.DataFrame] = data

def get_client_type(evaluate_kwargs: Dict[str, Any]) -> Literal["run_submitter", "pf_client", "code_client"]:
"""Determines the BatchClient to use from provided kwargs (_use_run_submitter_client and _use_pf_client)"""
"""Determines the BatchClient to use from provided kwargs (_use_run_submitter_client and _use_pf_client).

Defaults to run_submitter. Explicit flags select alternate clients and certain tri-state combinations are
preserved for backward compatibility.
"""
_use_run_submitter_client = cast(Optional[bool], kwargs.pop("_use_run_submitter_client", None))
_use_pf_client = cast(Optional[bool], kwargs.pop("_use_pf_client", None))

if _use_run_submitter_client is None and _use_pf_client is None:
# If both are unset, return default
return "run_submitter"

if _use_run_submitter_client and _use_pf_client:
Expand All @@ -1044,20 +1047,21 @@ def get_client_type(evaluate_kwargs: Dict[str, Any]) -> Literal["run_submitter",
blame=ErrorBlame.USER_ERROR,
)

if _use_run_submitter_client == False and _use_pf_client == False:
if _use_run_submitter_client is False and _use_pf_client is False:
return "code_client"

if _use_run_submitter_client:
return "run_submitter"
if _use_pf_client:
return "pf_client"

if _use_run_submitter_client is None and _use_pf_client == False:
if _use_run_submitter_client is None and _use_pf_client is False:
return "run_submitter"
if _use_run_submitter_client == False and _use_pf_client is None:
if _use_run_submitter_client is False and _use_pf_client is None:
return "pf_client"

assert False, "This should be impossible"
# Should be unreachable
return "run_submitter"

client_type: Literal["run_submitter", "pf_client", "code_client"] = get_client_type(kwargs)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,17 +12,22 @@

class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
"""
Evaluates coherence score for a given query and response or a multi-turn conversation, including reasoning.
Evaluates coherence for a given query and response or a multi-turn
conversation, including reasoning.

The coherence measure assesses the ability of the language model to generate text that reads naturally,
flows smoothly, and resembles human-like language in its responses. Use it when assessing the readability
and user-friendliness of a model's generated responses in real-world applications.
The coherence measure assesses the model's ability to generate text that
reads naturally, flows smoothly, and resembles human-like language. Use it
when assessing the readability and user-friendliness of responses.

:param model_config: Configuration for the Azure OpenAI model.
:type model_config: Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
:type model_config:
Union[~azure.ai.evaluation.AzureOpenAIModelConfiguration,
~azure.ai.evaluation.OpenAIModelConfiguration]
:param threshold: The threshold for the coherence evaluator. Default is 3.
:type threshold: int
:keyword is_reasoning_model: (Preview) config for chat completions is
updated to use reasoning models
:type is_reasoning_model: bool

.. admonition:: Example:

Expand All @@ -31,7 +36,8 @@ class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
:end-before: [END coherence_evaluator]
:language: python
:dedent: 8
:caption: Initialize and call CoherenceEvaluator using azure.ai.evaluation.AzureAIProject
:caption: Initialize and call CoherenceEvaluator using
azure.ai.evaluation.AzureAIProject

.. admonition:: Example using Azure AI Project URL:

Expand All @@ -40,7 +46,8 @@ class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
:end-before: [END coherence_evaluator]
:language: python
:dedent: 8
:caption: Initialize and call CoherenceEvaluator using Azure AI Project URL in following format
:caption: Initialize and call CoherenceEvaluator using Azure AI
Project URL in following format
https://{resource_name}.services.ai.azure.com/api/projects/{project_name}

.. admonition:: Example with Threshold:
Expand All @@ -50,23 +57,24 @@ class CoherenceEvaluator(PromptyEvaluatorBase[Union[str, float]]):
:end-before: [END threshold_coherence_evaluator]
:language: python
:dedent: 8
:caption: Initialize with threshold and call a CoherenceEvaluator with a query and response.
:caption: Initialize with threshold and call a CoherenceEvaluator
with a query and response.

.. note::

To align with our support of a diverse set of models, an output key without the `gpt_` prefix has been added.
To maintain backwards compatibility, the old key with the `gpt_` prefix is still be present in the output;
however, it is recommended to use the new key moving forward as the old key will be deprecated in the future.
To align with support of diverse models, an output key without the
`gpt_` prefix has been added. The old key with the `gpt_` prefix is
still present for compatibility; however, it will be deprecated.
"""

_PROMPTY_FILE = "coherence.prompty"
_RESULT_KEY = "coherence"

id = "azureai://built-in/evaluators/coherence"
"""Evaluator identifier, experimental and to be used only with evaluation in cloud."""
"""Evaluator identifier, experimental to be used only with cloud evaluation"""

@override
def __init__(self, model_config, *, threshold=3, credential=None):
def __init__(self, model_config, *, threshold=3, credential=None, **kwargs):
current_dir = os.path.dirname(__file__)
prompty_path = os.path.join(current_dir, self._PROMPTY_FILE)
self._threshold = threshold
Expand All @@ -78,6 +86,7 @@ def __init__(self, model_config, *, threshold=3, credential=None):
threshold=threshold,
credential=credential,
_higher_is_better=self._higher_is_better,
**kwargs,
)

@overload
Expand Down Expand Up @@ -105,9 +114,11 @@ def __call__(
) -> Dict[str, Union[float, Dict[str, List[Union[str, float]]]]]:
"""Evaluate coherence for a conversation

:keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
key "messages", and potentially a global context under the key "context". Conversation turns are expected
to be dictionaries with keys "content", "role", and possibly "context".
:keyword conversation: The conversation to evaluate. Expected to
contain a list of conversation turns under the key "messages",
and optionally a global context under the key "context". Turns are
dictionaries with keys "content", "role", and possibly
"context".
:paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
:return: The coherence score.
:rtype: Dict[str, Union[float, Dict[str, List[float]]]]
Expand All @@ -119,19 +130,22 @@ def __call__( # pylint: disable=docstring-missing-param
*args,
**kwargs,
):
"""Evaluate coherence. Accepts either a query and response for a single evaluation,
or a conversation for a potentially multi-turn evaluation. If the conversation has more than one pair of
turns, the evaluator will aggregate the results of each turn.
"""Evaluate coherence.

Accepts a query/response for a single evaluation, or a conversation
for a multi-turn evaluation. If the conversation has more than one
pair of turns, results are aggregated.

:keyword query: The query to be evaluated.
:paramtype query: str
:keyword response: The response to be evaluated.
:paramtype response: Optional[str]
:keyword conversation: The conversation to evaluate. Expected to contain a list of conversation turns under the
key "messages". Conversation turns are expected
to be dictionaries with keys "content" and "role".
:keyword conversation: The conversation to evaluate. Expected to
contain conversation turns under the key "messages" as
dictionaries with keys "content" and "role".
:paramtype conversation: Optional[~azure.ai.evaluation.Conversation]
:return: The relevance score.
:rtype: Union[Dict[str, float], Dict[str, Union[float, Dict[str, List[float]]]]]
:rtype: Union[Dict[str, float], Dict[str, Union[float, Dict[str,
List[float]]]]]
"""
return super().__call__(*args, **kwargs)
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@
from concurrent.futures import as_completed
from typing import TypeVar, Dict, List

from azure.ai.evaluation._legacy._adapters.tracing import ThreadPoolExecutorWithContext as ThreadPoolExecutor
from azure.ai.evaluation._legacy._adapters.tracing import (
ThreadPoolExecutorWithContext as ThreadPoolExecutor,
)
from typing_extensions import override

from azure.ai.evaluation._evaluators._common import EvaluatorBase
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,23 +3,36 @@
# ---------------------------------------------------------

import math
import os
import re
import os
from typing import Dict, Optional, TypeVar, Union
from typing import Dict, TypeVar, Union, Optional

if os.getenv("AI_EVALS_USE_PF_PROMPTY", "false").lower() == "true":
from promptflow.core._flow import AsyncPrompty
else:
from azure.ai.evaluation._legacy.prompty import AsyncPrompty
from azure.ai.evaluation._legacy.prompty import (
AsyncPrompty as _LegacyAsyncPrompty,
)
from azure.ai.evaluation._legacy._adapters._flows import AsyncPrompty
from azure.core.credentials import TokenCredential
from typing_extensions import override

from azure.core.credentials import TokenCredential
from azure.ai.evaluation._common.constants import PROMPT_BASED_REASON_EVALUATORS
from azure.ai.evaluation._constants import EVALUATION_PASS_FAIL_MAPPING
from azure.ai.evaluation._exceptions import EvaluationException, ErrorBlame, ErrorCategory, ErrorTarget
from ..._common.utils import construct_prompty_model_config, validate_model_config, parse_quality_evaluator_reason_score
from azure.ai.evaluation._exceptions import (
EvaluationException,
ErrorBlame,
ErrorCategory,
ErrorTarget,
)
from ..._common.utils import (
construct_prompty_model_config,
validate_model_config,
parse_quality_evaluator_reason_score,
)
from . import EvaluatorBase

_PFAsyncPrompty = None # type: ignore[assignment]


try:
from ..._user_agent import UserAgentSingleton
except ImportError:
Expand Down Expand Up @@ -73,7 +86,11 @@ def __init__(
self._prompty_file = prompty_file
self._threshold = threshold
self._higher_is_better = _higher_is_better
super().__init__(eval_last_turn=eval_last_turn, threshold=threshold, _higher_is_better=_higher_is_better)
super().__init__(
eval_last_turn=eval_last_turn,
threshold=threshold,
_higher_is_better=_higher_is_better,
)

subclass_name = self.__class__.__name__
user_agent = f"{UserAgentSingleton().value} (type=evaluator subtype={subclass_name})"
Expand Down
Loading
Loading