Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
It's important to compare how closely the textual response generated by your AI system matches the response you would expect, typically called the "ground truth". Use LLM-judge metric like SimilarityEvaluator
with a focus on the semantic similarity between the generated response and the ground truth, or use metrics from the field of natural language processing (NLP) including F1 Score, BLEU, GLEU, ROUGE, and METEOR with a focus on the overlaps of tokens or n-grams between the two.
Model configuration for AI-assisted evaluators
For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the LLM-judge:
import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()
model_config = AzureOpenAIModelConfiguration(
azure_endpoint=os.environ["AZURE_ENDPOINT"],
api_key=os.environ.get["AZURE_API_KEY"],
azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
api_version=os.environ.get("AZURE_API_VERSION"),
)
Tip
We recommend using o3-mini
for a balance of reasoning capability and cost efficiency.
Similarity
SimilarityEvaluator
measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.
Similarity example
from azure.ai.evaluation import SimilarityEvaluator
similarity = SimilarityEvaluator(model_config=model_config, threshold=3)
similarity(
query="Is Marie Curie is born in Paris?",
response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
ground_truth="Marie Curie was born in Warsaw."
)
Similarity output
The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.
{
"similarity": 4.0,
"gpt_similarity": 4.0,
"similarity_result": "pass",
"similarity_threshold": 3
}
F1 score
F1ScoreEvaluator
measures the similarity by shared tokens between the generated text and the ground truth, focusing on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. Ratio is computed over the individual words in the generated response against those in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score. Precision is the ratio of the number of shared words to the total number of words in the generation. Recall is the ratio of the number of shared words to the total number of words in the ground truth.
F1 score example
from azure.ai.evaluation import F1ScoreEvaluator
f1_score = F1ScoreEvaluator(threshold=0.5)
f1_score(
response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
ground_truth="Marie Curie was born in Warsaw."
)
F1 score output
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
{
"f1_score": 0.631578947368421,
"f1_result": "pass",
"f1_threshold": 0.5
}
BLEU score
BleuScoreEvaluator
computes the BLEU (Bilingual Evaluation Understudy) score commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text.
BLEU example
from azure.ai.evaluation import BleuScoreEvaluator
bleu_score = BleuScoreEvaluator(threshold=0.3)
bleu_score(
response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
ground_truth="Marie Curie was born in Warsaw."
)
BLEU output
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
{
"bleu_score": 0.1550967560878879,
"bleu_result": "fail",
"bleu_threshold": 0.3
}
GLEU score
GleuScoreEvaluator
computes the GLEU (Google-BLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth, similar to the BLEU score, focusing on both precision and recall. But it addresses the drawbacks of the BLEU score using a per-sentence reward objective.
GLEU score example
from azure.ai.evaluation import GleuScoreEvaluator
gleu_score = GleuScoreEvaluator(threshold=0.2)
gleu_score(
response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
ground_truth="Marie Curie was born in Warsaw."
)
GLEU score output
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
{
"gleu_score": 0.25925925925925924,
"gleu_result": "pass",
"gleu_threshold": 0.2
}
ROUGE score
RougeScoreEvaluator
computes the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.
ROUGE score example
from azure.ai.evaluation import RougeScoreEvaluator, RougeType
rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_L, precision_threshold=0.6, recall_threshold=0.5, f1_score_threshold=0.55)
rouge(
response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
ground_truth="Marie Curie was born in Warsaw."
)
ROUGE score output
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
{
"rouge_precision": 0.46153846153846156,
"rouge_recall": 1.0,
"rouge_f1_score": 0.631578947368421,
"rouge_precision_result": "fail",
"rouge_recall_result": "pass",
"rouge_f1_score_result": "pass",
"rouge_precision_threshold": 0.6,
"rouge_recall_threshold": 0.5,
"rouge_f1_score_threshold": 0.55
}
METEOR score
MeteorScoreEvaluator
measures the similarity by shared n-grams between the generated text and the ground truth, similar to the BLEU score, focusing on precision and recall. But it addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.
METEOR score example
from azure.ai.evaluation import MeteorScoreEvaluator
meteor_score = MeteorScoreEvaluator(threshold=0.9)
meteor_score(
response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
ground_truth="Marie Curie was born in Warsaw."
)
METEOR score output
The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.
{
"meteor_score": 0.8621140763997908,
"meteor_result": "fail",
"meteor_threshold": 0.9
}