Textual similarity evaluators

2025-05-19

It's important to compare how closely the textual response generated by your AI system matches the response you would expect, typically called the "ground truth". Use LLM-judge metric like SimilarityEvaluator with a focus on the semantic similarity between the generated response and the ground truth, or use metrics from the field of natural language processing (NLP) including F1 Score, BLEU, GLEU, ROUGE, and METEOR with a focus on the overlaps of tokens or n-grams between the two.

Model configuration for AI-assisted evaluators

For reference in the following code snippets, the AI-assisted evaluators use a model configuration for the LLM-judge:

import os
from azure.ai.evaluation import AzureOpenAIModelConfiguration
from dotenv import load_dotenv
load_dotenv()

model_config = AzureOpenAIModelConfiguration(
    azure_endpoint=os.environ["AZURE_ENDPOINT"],
    api_key=os.environ.get["AZURE_API_KEY"],
    azure_deployment=os.environ.get("AZURE_DEPLOYMENT_NAME"),
    api_version=os.environ.get("AZURE_API_VERSION"),
)

Tip

We recommend using o3-mini for a balance of reasoning capability and cost efficiency.

Similarity

SimilarityEvaluator measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response (instead of simple overlap in tokens or n-grams) and also considers the broader context of a query.

Similarity example

from azure.ai.evaluation import SimilarityEvaluator

similarity = SimilarityEvaluator(model_config=model_config, threshold=3)
similarity(
    query="Is Marie Curie is born in Paris?", 
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

Similarity output

The numerical score on a likert scale (integer 1 to 5) and a higher score means a higher degree of similarity. Given a numerical threshold (default to 3), we also output "pass" if the score >= threshold, or "fail" otherwise. Using the reason field can help you understand why the score is high or low.

{
    "similarity": 4.0,
    "gpt_similarity": 4.0,
    "similarity_result": "pass",
    "similarity_threshold": 3
}

F1 score

F1ScoreEvaluator measures the similarity by shared tokens between the generated text and the ground truth, focusing on both precision and recall. The F1-score computes the ratio of the number of shared words between the model generation and the ground truth. Ratio is computed over the individual words in the generated response against those in the ground truth answer. The number of shared words between the generation and the truth is the basis of the F1 score. Precision is the ratio of the number of shared words to the total number of words in the generation. Recall is the ratio of the number of shared words to the total number of words in the ground truth.

F1 score example

from azure.ai.evaluation import F1ScoreEvaluator

f1_score = F1ScoreEvaluator(threshold=0.5)
f1_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

F1 score output

The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.

{
    "f1_score": 0.631578947368421,
    "f1_result": "pass",
    "f1_threshold": 0.5
}

BLEU score

BleuScoreEvaluator computes the BLEU (Bilingual Evaluation Understudy) score commonly used in natural language processing (NLP) and machine translation. It measures how closely the generated text matches the reference text.

BLEU example

from azure.ai.evaluation import BleuScoreEvaluator

bleu_score = BleuScoreEvaluator(threshold=0.3)
bleu_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

BLEU output

The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.

{
    "bleu_score": 0.1550967560878879,
    "bleu_result": "fail",
    "bleu_threshold": 0.3
}

GLEU score

GleuScoreEvaluator computes the GLEU (Google-BLEU) score. It measures the similarity by shared n-grams between the generated text and ground truth, similar to the BLEU score, focusing on both precision and recall. But it addresses the drawbacks of the BLEU score using a per-sentence reward objective.

GLEU score example

from azure.ai.evaluation import GleuScoreEvaluator


gleu_score = GleuScoreEvaluator(threshold=0.2)
gleu_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

GLEU score output

The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.

{
    "gleu_score": 0.25925925925925924,
    "gleu_result": "pass",
    "gleu_threshold": 0.2
}

ROUGE score

RougeScoreEvaluator computes the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores, a set of metrics used to evaluate automatic summarization and machine translation. It measures the overlap between generated text and reference summaries. ROUGE focuses on recall-oriented measures to assess how well the generated text covers the reference text. The ROUGE score is composed of precision, recall, and F1 score.

ROUGE score example

from azure.ai.evaluation import RougeScoreEvaluator, RougeType

rouge = RougeScoreEvaluator(rouge_type=RougeType.ROUGE_L, precision_threshold=0.6, recall_threshold=0.5, f1_score_threshold=0.55) 
rouge(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

ROUGE score output

The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.

{
    "rouge_precision": 0.46153846153846156,
    "rouge_recall": 1.0,
    "rouge_f1_score": 0.631578947368421,
    "rouge_precision_result": "fail",
    "rouge_recall_result": "pass",
    "rouge_f1_score_result": "pass",
    "rouge_precision_threshold": 0.6,
    "rouge_recall_threshold": 0.5,
    "rouge_f1_score_threshold": 0.55
}

METEOR score

MeteorScoreEvaluator measures the similarity by shared n-grams between the generated text and the ground truth, similar to the BLEU score, focusing on precision and recall. But it addresses limitations of other metrics like the BLEU score by considering synonyms, stemming, and paraphrasing for content alignment.

METEOR score example

from azure.ai.evaluation import MeteorScoreEvaluator

meteor_score = MeteorScoreEvaluator(threshold=0.9)
meteor_score(
    response="According to wikipedia, Marie Curie was not born in Paris but in Warsaw.",
    ground_truth="Marie Curie was born in Warsaw."
)

METEOR score output

The numerical score is a 0-1 float and a higher score is better. Given a numerical threshold (default to 0.5), we also output "pass" if the score >= threshold, or "fail" otherwise.

{
    "meteor_score": 0.8621140763997908,
    "meteor_result": "fail",
    "meteor_threshold": 0.9
}

Share via

Textual similarity evaluators

Model configuration for AI-assisted evaluators

Similarity

Similarity example

Similarity output

F1 score

F1 score example

F1 score output

BLEU score

BLEU example

BLEU output

GLEU score

GLEU score example

GLEU score output

ROUGE score

ROUGE score example

ROUGE score output

METEOR score

METEOR score example

METEOR score output

Related content

Feedback

Additional resources