你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

使用 Azure AI 评估 SDK 在本地评估 AI 代理（预览版）

2025-05-20

重要

本文中标记了“（预览版）”的项目目前为公共预览版。此预览版未提供服务级别协议，不建议将其用于生产工作负载。某些功能可能不受支持或者受限。有关详细信息，请参阅 Microsoft Azure 预览版补充使用条款。

AI 代理是强大的生产力助手，可为业务需求创建工作流。但是，由于它们复杂的交互模式，它们面临着可观测性的挑战。本文介绍如何在简单的代理数据或代理消息上本地运行内置计算器。

为了生成生产就绪的代理应用程序并实现可观测性和透明度，开发人员不仅需要工具来评估代理工作流的最终输出，还需要评估工作流本身的质量和效率。例如，请考虑典型的代理工作流：

用户查询“明天天气”之类的事件会触发代理工作流。它开始执行多个步骤，例如通过用户意向、工具调用进行推理，以及利用检索扩充生成生成最终响应。在此过程中，评估工作流的每个步骤以及最终输出的质量和安全性至关重要。具体来说，我们会将这些评估方面制定成代理的以下评估程序：

意向解析：度量代理是否正确标识用户的意向。
工具调用准确性：度量代理是否对用户的请求进行了正确的函数工具调用。
任务符合性：根据代理的系统消息和先前步骤衡量代理的最终响应是否遵守其分配的任务。

还可以使用我们全面的内置评估程序套件来评估代理工作流的其他质量和安全方面。通常，代理会发出代理消息。将代理消息转换为正确的评估数据以使用我们的评估器可能是一项非琐碎的任务。如果您使用 Azure AI 代理服务构建代理，则可以通过我们的转换器支持对其进行无缝评估。如果您在 Azure AI 代理服务之外构建代理，则仍然可以通过将代理消息解析为所需的数据格式，根据您的代理工作流程适当使用我们的评估工具。请参阅评估其他代理的示例。

入门指南

首先从 Azure AI 评估 SDK 安装评估器包：

pip install azure-ai-evaluation

评估 Azure AI 代理

但是，如果使用 Azure AI 代理服务，您可以通过我们对 Azure AI 代理线程和运行的转换器支持来无缝评估您的代理。我们支持转换器提供的 Azure AI 代理消息评估程序列表：

质量：IntentResolution、ToolCallAccuracy、TaskAdherence、Relevance、Coherence、Fluency
安全：CodeVulnerabilities、Violence、Self-harm、Sexual、HateUnfairness、IndirectAttack、ProtectedMaterials。

注释

ToolCallAccuracyEvaluator 仅支持 Azure AI 代理的函数工具评估，但不支持内置工具评估。代理消息必须至少调用一个函数工具才能进行评估。

下面是无缝生成和评估 Azure AI 代理的示例。与评估分开，Azure AI Foundry 代理服务需要 pip install azure-ai-projects azure-identity 和 Azure AI 项目连接字符串和支持的模型。

创建代理线程并运行

import os, json
import pandas as pd
from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential
from typing import Set, Callable, Any
from azure.ai.projects.models import FunctionTool, ToolSet

from dotenv import load_dotenv

load_dotenv()

# Define some custom python function
def fetch_weather(___location: str) -> str:
    """
    Fetches the weather information for the specified ___location.

    :param ___location (str): The ___location to fetch weather for.
    :return: Weather information as a JSON string.
    :rtype: str
    """
    # In a real-world scenario, you'd integrate with a weather API.
    # Here, we'll mock the response.
    mock_weather_data = {"Seattle": "Sunny, 25°C", "London": "Cloudy, 18°C", "Tokyo": "Rainy, 22°C"}
    weather = mock_weather_data.get(___location, "Weather data not available for this ___location.")
    weather_json = json.dumps({"weather": weather})
    return weather_json


user_functions: Set[Callable[..., Any]] = {
    fetch_weather,
}

# Adding Tools to be used by Agent 
functions = FunctionTool(user_functions)

toolset = ToolSet()
toolset.add(functions)


# Create the agent
AGENT_NAME = "Seattle Tourist Assistant"

project_client = AIProjectClient.from_connection_string(
    credential=DefaultAzureCredential(),
    conn_str=os.environ["PROJECT_CONNECTION_STRING"],
)

agent = project_client.agents.create_agent(
    model=os.environ["MODEL_DEPLOYMENT_NAME"],
    name=AGENT_NAME,
    instructions="You are a helpful assistant",
    toolset=toolset,
)
print(f"Created agent, ID: {agent.id}")

thread = project_client.agents.create_thread()
print(f"Created thread, ID: {thread.id}")

# Create message to thread
MESSAGE = "Can you fetch me the weather in Seattle?"

message = project_client.agents.create_message(
    thread_id=thread.id,
    role="user",
    content=MESSAGE,
)
print(f"Created message, ID: {message.id}")

run = project_client.agents.create_and_process_run(thread_id=thread.id, agent_id=agent.id)

print(f"Run finished with status: {run.status}")

if run.status == "failed":
    print(f"Run failed: {run.last_error}")

print(f"Run ID: {run.id}")

# display messages
for message in project_client.agents.list_messages(thread.id, order="asc").data:
    print(f"Role: {message.role}")
    print(f"Content: {message.content[0].text.value}")
    print("-" * 40)

评估单个代理运行

创建代理运行后，可以轻松使用转换器将 Azure AI 代理线程数据转换为评估程序可以理解的必需评估数据。

import json, os
from azure.ai.evaluation import AIAgentConverter, IntentResolutionEvaluator

# Initialize the converter for Azure AI agents
converter = AIAgentConverter(project_client)

# Specify the thread and run id
thread_id = thread.id
run_id = run.id

converted_data = converter.convert(thread_id, run_id)

就是这样！您不需要读取每个评估者的输入要求，也不必进行任何工作来解析它们。所有你需要做的是选择你的评估器，并在这次单独运行中调用评估器。对于模型选择，我们建议选择一个强大的推理模型，比如 o3-mini 及其后发布的模型。我们在 quality_evaluators 和 safety_evaluators 中设置一个质量和安全评估程序列表，并在评估多个代理运行或一个线程中引用它们。

# specific to agentic workflows
from azure.ai.evaluation import IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator 
# other quality as well as risk and safety metrics
from azure.ai.evaluation import RelevanceEvaluator, CoherenceEvaluator, CodeVulnerabilityEvaluator, ContentSafetyEvaluator, IndirectAttackEvaluator, FluencyEvaluator
from azure.ai.projects.models import ConnectionType
from azure.identity import DefaultAzureCredential

import os
from dotenv import load_dotenv
load_dotenv()

model_config = project_client.connections.get_default(
                                            connection_type=ConnectionType.AZURE_OPEN_AI,
                                            include_credentials=True) \
                                         .to_evaluator_model_config(
                                            deployment_name="o3-mini",
                                            api_version="2023-05-15",
                                            include_credentials=True
                                          )

quality_evaluators = {evaluator.__name__: evaluator(model_config=model_config) for evaluator in [IntentResolutionEvaluator, TaskAdherenceEvaluator, ToolCallAccuracyEvaluator, CoherenceEvaluator, FluencyEvaluator, RelevanceEvaluator]}


## Using Azure AI Foundry Hub
azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("AZURE_RESOURCE_GROUP"),
    "project_name": os.environ.get("AZURE_PROJECT_NAME"),
}
## Using Azure AI Foundry Development Platform, example: AZURE_AI_PROJECT=https://your-account.services.ai.azure.com/api/projects/your-project
azure_ai_project = os.environ.get("AZURE_AI_PROJECT")

safety_evaluators = {evaluator.__name__: evaluator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential()) for evaluator in[ContentSafetyEvaluator, IndirectAttackEvaluator, CodeVulnerabilityEvaluator]}

# reference the quality and safety evaluator list above
quality_and_safety_evaluators = {**quality_evaluators, **safety_evaluators}

for name, evaluator in quality_and_safety_evaluators.items():
   try:
      result = evaluator(**converted_data)
      print(name)
      print(json.dumps(result, indent=4)) 
   except:
      print("Note: if there is no tool call to evaluate in the run history, ToolCallAccuracyEvaluator will raise an error")
      pass

输出格式

查询和响应对的 AI 辅助质量评估程序的结果是一个字典，其中包含：

{metric_name} 提供一个数字分数，采用 likert 量表（整数 1 到 5）或 0-1 之间的浮点数。
{metric_name}_label 提供二进制标签（如果指标自然输出二进制分数）。
{metric_name}_reason 解释为何为每个数据点提供了一个特定分数或标签。

为了进一步提高可理解性，所有评估器都接受一个二进制阈值（除非它们已输出二进制结果），并输出两个新的键值。对于二进制化阈值，将设置默认值，用户可以替代它。这两个新密钥包括：

{metric_name}_result 基于二进制化阈值的“pass”或“fail”字符串。
{metric_name}_threshold 默认情况下或用户设置的数字二进制化阈值。
additional_details 包含有关单个代理运行质量的调试信息。

一些评估者的示例输出：

{
    "intent_resolution": 5.0, # likert scale: 1-5 integer 
    "intent_resolution_result": "pass", # pass because 5 > 3 the threshold
    "intent_resolution_threshold": 3,
    "intent_resolution_reason": "The assistant correctly understood the user's request to fetch the weather in Seattle. It used the appropriate tool to get the weather information and provided a clear and accurate response with the current weather conditions in Seattle. The response fully resolves the user's query with all necessary information.",
    "additional_details": {
        "conversation_has_intent": true,
        "agent_perceived_intent": "fetch the weather in Seattle",
        "actual_user_intent": "fetch the weather in Seattle",
        "correct_intent_detected": true,
        "intent_resolved": true
    }
}
{
    "task_adherence": 5.0, # likert scale: 1-5 integer 
    "task_adherence_result": "pass", # pass because 5 > 3 the threshold
    "task_adherence_threshold": 3,
    "task_adherence_reason": "The response accurately follows the instructions, fetches the correct weather information, and relays it back to the user without any errors or omissions."
}
{
    "tool_call_accuracy": 1.0,  # this is the average of all correct tool calls (or passing rate) 
    "tool_call_accuracy_result": "pass", # pass because 1.0 > 0.8 the threshold
    "tool_call_accuracy_threshold": 0.8,
    "per_tool_call_details": [
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The tool call is directly relevant to the user's query, uses the correct parameter, and the parameter value is correctly extracted from the conversation.",
            "tool_call_id": "call_2svVc9rNxMT9F50DuEf1XExx"
        }
    ]
}

评估多个代理运行或线程

若要评估多个代理运行或线程，建议使用批处理 evaluate() API 进行异步评估。首先，通过转换器支持将代理线程数据转换为文件：

import json
from azure.ai.evaluation import AIAgentConverter

# Initialize the converter
converter = AIAgentConverter(project_client)

# Specify a file path to save agent output (which is evaluation input data)
filename = os.path.join(os.getcwd(), "evaluation_input_data.jsonl")

evaluation_data = converter.prepare_evaluation_data(thread_ids=thread_id, filename=filename) 

print(f"Evaluation data saved to {filename}")

在一行代码中准备好评估数据后，可以选择评估者来评估代理质量并提交批量评估运行。此处，我们在“评估单个代理运行quality_and_safety_evaluators”部分中引用相同的质量和安全评估程序列表：

import os
from dotenv import load_dotenv
load_dotenv()


# Batch evaluation API (local)
from azure.ai.evaluation import evaluate

response = evaluate(
    data=filename,
    evaluation_name="agent demo - batch run",
    evaluators=quality_and_safety_evaluators,
    # optionally, log your results to your Azure AI Foundry project for rich visualization 
    azure_ai_project={
        "subscription_id": os.environ["AZURE_SUBSCRIPTION_ID"],
        "project_name": os.environ["PROJECT_NAME"],
        "resource_group_name": os.environ["RESOURCE_GROUP_NAME"],
    }
)
# Inspect the average scores at a high-level
print(response["metrics"])
# Use the URL to inspect the results on the UI
print(f'AI Foundary URL: {response.get("studio_url")}')

在 URI 之后，将重定向到 Foundry 以查看 Azure AI 项目中的评估结果并调试应用程序。使用理由字段和通过/失败评估，您可以轻松评估应用程序的质量和安全性能。可以运行和比较多个运行以测试回归或改进。

借助 Azure AI 评估 SDK 客户端库，可以通过转换器支持无缝评估 Azure AI 代理，从而实现代理工作流的可观测性和透明度。

评估其他代理

对于 Azure AI Foundry 代理服务外部的代理，仍可以通过为所选评估者准备正确的数据来评估它们。

代理通常发出消息以与用户或其他代理交互。我们的内置评估器可以根据query接受简单的数据类型，例如responseground_truth中的字符串。但是，由于代理和框架差异的复杂交互模式，从代理消息中提取这些简单数据可能是一个难题。例如，如前所述，单个用户查询可以触发一长串代理消息，通常调用多个工具调用。

如示例中所示，我们专门为这些内置计算器IntentResolution启用了代理消息支持，ToolCallAccuracyTaskAdherence以评估代理工作流的这些方面。这些评估器采用 tool_calls 或 tool_definitions 作为代理特有的参数。

计算器	`query`	`response`	`tool_calls`	`tool_definitions`
`IntentResolutionEvaluator`	必需：`Union[str, list[Message]]`	必需：`Union[str, list[Message]]`	无	自选： `list[ToolCall]`
`ToolCallAccuracyEvaluator`	必需：`Union[str, list[Message]]`	自选： `Union[str, list[Message]]`	自选： `Union[dict, list[ToolCall]]`	必需：`list[ToolDefinition]`
`TaskAdherenceEvaluator`	必需：`Union[str, list[Message]]`	必需：`Union[str, list[Message]]`	无	自选： `list[ToolCall]`

Message： dict 描述代理与用户交互的 openai 样式消息，其中 query 必须包含系统消息作为第一条消息。
ToolCall：dict 指定代理与用户交互期间调用的工具调用。
ToolDefinition： dict 描述代理可用的工具。

对于 ToolCallAccuracyEvaluator，必须提供 response 或 tool_calls。

我们将演示两种数据格式的一些示例：简单的代理数据和代理消息。但是，由于这些计算器的独特要求，我们建议参考示例笔记本，说明每个计算器的可能输入路径。

与其他内置 AI 辅助质量评估程序一样，IntentResolutionEvaluator 和 TaskAdherenceEvaluator 会输出 likert 分数（1-5 的整数；分数越高越好）。 ToolCallAccuracyEvaluator 根据用户查询输出所有工具调用的通过率（一个介于 0 到 1 之间的浮点数）。为了进一步提高可理解性，所有计算器都接受二进制阈值并输出两个新密钥。对于二进制化阈值，将设置默认值，用户可以替代它。这两个新密钥包括：

{metric_name}_result 基于二进制化阈值的“pass”或“fail”字符串。
{metric_name}_threshold 默认情况下或用户设置的数字二进制化阈值。

简单代理数据

采用简单的代理数据格式， query 并且 response 是简单的 python 字符串。例如：

import os
import json
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from azure.identity import DefaultAzureCredential
from azure.ai.evaluation import IntentResolutionEvaluator, ResponseCompletenessEvaluator
  
model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_deployment=os.environ["MODEL_DEPLOYMENT_NAME"],
)
 
intent_resolution_evaluator = IntentResolutionEvaluator(model_config)

# Evaluating query and response as strings
# A positive example. Intent is identified and understood and the response correctly resolves user intent
result = intent_resolution_evaluator(
    query="What are the opening hours of the Eiffel Tower?",
    response="Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM.",
)
print(json.dumps(result, indent=4))

输出（有关详细信息，请参阅输出格式）：

{
    "intent_resolution": 5.0,
    "intent_resolution_result": "pass",
    "intent_resolution_threshold": 3,
    "intent_resolution_reason": "The response provides the opening hours of the Eiffel Tower, which directly addresses the user's query. The information is clear, accurate, and complete, fully resolving the user's intent.",
    "additional_details": {
        "conversation_has_intent": true,
        "agent_perceived_intent": "inquire about the opening hours of the Eiffel Tower",
        "actual_user_intent": "inquire about the opening hours of the Eiffel Tower",
        "correct_intent_detected": true,
        "intent_resolved": true
    }
}

tool_calls 的 tool_definitions 和 ToolCallAccuracyEvaluator 示例：

import json 

query = "How is the weather in Seattle?"
tool_calls = [{
                    "type": "tool_call",
                    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                    "name": "fetch_weather",
                    "arguments": {
                        "___location": "Seattle"
                    }
            },
            {
                    "type": "tool_call",
                    "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ",
                    "name": "fetch_weather",
                    "arguments": {
                        "___location": "London"
                    }
            }]

tool_definitions = [{
                    "name": "fetch_weather",
                    "description": "Fetches the weather information for the specified ___location.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "___location": {
                                "type": "string",
                                "description": "The ___location to fetch weather for."
                            }
                        }
                    }
                }]
response = tool_call_accuracy(query=query, tool_calls=tool_calls, tool_definitions=tool_definitions)
print(json.dumps(response, indent=4))

输出（有关详细信息，请参阅输出格式）：

{
    "tool_call_accuracy": 0.5,
    "tool_call_accuracy_result": "fail",
    "tool_call_accuracy_threshold": 0.8,
    "per_tool_call_details": [
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's query, uses appropriate parameters, and the parameter values are correctly extracted from the conversation. It is likely to provide useful information to advance the conversation.",
            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
        },
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is not relevant to the user's query about the weather in Seattle and uses a parameter value that is not present or inferred from the conversation.",
            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
        }
    ]
}

代理人信息

在代理消息格式中，query 和 response 是 openai 样式消息的列表。具体而言， query 将过去代理用户交互传送到最后一个用户查询，并要求在列表顶部显示系统消息（代理），并将 response 传递代理的最后一条消息以响应最后一个用户查询。示例：

import json

# user asked a question
query = [
    {
        "role": "system",
        "content": "You are a friendly and helpful customer service agent."
    },
    # past interactions omitted 
    # ...
    {
        "createdAt": "2025-03-14T06:14:20Z",
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Hi, I need help with the last 2 orders on my account #888. Could you please update me on their status?"
            }
        ]
    }
]
# the agent emits multiple messages to fulfill the request
response = [
    {
        "createdAt": "2025-03-14T06:14:30Z",
        "run_id": "0",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "Hello! Let me quickly look up your account details."
            }
        ]
    },
    {
        "createdAt": "2025-03-14T06:14:35Z",
        "run_id": "0",
        "role": "assistant",
        "content": [
            {
                "type": "tool_call",
                "tool_call_id": "tool_call_20250310_001",
                "name": "get_orders",
                "arguments": {
                    "account_number": "888"
                }
            }
        ]
    },
    # many more messages omitted 
    # ...
    # here is the agent's final response 
    {
        "createdAt": "2025-03-14T06:15:05Z",
        "run_id": "0",
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "The order with ID 123 has been shipped and is expected to be delivered on March 15, 2025. However, the order with ID 124 is delayed and should now arrive by March 20, 2025. Is there anything else I can help you with?"
            }
        ]
    }
]

# An example of tool definitions available to the agent 
tool_definitions = [
    {
        "name": "get_orders",
        "description": "Get the list of orders for a given account number.",
        "parameters": {
            "type": "object",
            "properties": {
                "account_number": {
                    "type": "string",
                    "description": "The account number to get the orders for."
                }
            }
        }
    },
    # other tool definitions omitted 
    # ...
]

result = intent_resolution_evaluator(
    query=query,
    response=response,
    # optionally provide the tool definitions
    tool_definitions=tool_definitions 
)
print(json.dumps(result, indent=4))

输出（有关详细信息，请参阅输出格式）：

{
    "tool_call_accuracy": 0.5,
    "tool_call_accuracy_result": "fail",
    "tool_call_accuracy_threshold": 0.8,
    "per_tool_call_details": [
        {
            "tool_call_accurate": true,
            "tool_call_accurate_reason": "The TOOL CALL is directly relevant to the user's query, uses appropriate parameters, and the parameter values are correctly extracted from the conversation. It is likely to provide useful information to advance the conversation.",
            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
        },
        {
            "tool_call_accurate": false,
            "tool_call_accurate_reason": "The TOOL CALL is not relevant to the user's query about the weather in Seattle and uses a parameter value that is not present or inferred from the conversation.",
            "tool_call_id": "call_CUdbkBfvVBla2YP3p24uhElJ"
        }
    ]
}

此评估架构有助于在 Azure AI Foundry 代理服务外部分析代理数据，以便可以使用我们的评估程序来支持代理工作流的可观测性。

通过

使用 Azure AI 评估 SDK 在本地评估 AI 代理（预览版）

入门指南

评估 Azure AI 代理

创建代理线程并运行

评估单个代理运行

输出格式

评估多个代理运行或线程

评估其他代理

简单代理数据

代理人信息

示例笔记本

相关内容

反馈

其他资源