Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Run llama Llama index SemanticSimilarityEvaluator

From Leeroopedia
Knowledge Sources
Domains Evaluation, Similarity
Last Updated 2026-02-11 19:00 GMT

Overview

Evaluates the quality of a generated response by computing the embedding similarity between the response and a reference answer, without requiring an LLM judge.

Description

The SemanticSimilarityEvaluator is a concrete implementation of BaseEvaluator that measures response quality by comparing the semantic similarity of the generated response to a known reference answer. It is inspired by the paper "Semantic Answer Similarity for Evaluating Question Answering Models" (https://arxiv.org/pdf/2108.06130.pdf).

The evaluator works as follows:

  1. It embeds both the response and reference strings using the configured embedding model (BaseEmbedding).
  2. It computes a similarity score between the two embedding vectors using a configurable similarity function.
  3. It determines a passing result by checking if the similarity score meets or exceeds the similarity_threshold.

The similarity function can be configured in two ways:

  • By providing a similarity_mode (a SimilarityMode enum value) which uses the built-in similarity function with that mode. The default mode is SimilarityMode.DEFAULT.
  • By providing a custom similarity_fn callable that accepts two embedding vectors and returns a float. Note that similarity_mode and similarity_fn are mutually exclusive.

Unlike LLM-based evaluators, this evaluator does not require an LLM and has no prompts (_get_prompts returns an empty dict). It ignores the query and contexts parameters, requiring only response and reference.

Usage

Use this evaluator when you need a fast, deterministic, and cost-effective way to evaluate response quality against reference answers. It is particularly useful for regression testing, large-scale evaluation benchmarks where LLM judge calls would be prohibitively expensive, or as a complementary metric alongside LLM-based evaluators.

Code Reference

Source Location

  • Repository: Run_llama_Llama_index
  • File: llama-index-core/llama_index/core/evaluation/semantic_similarity.py

Signature

class SemanticSimilarityEvaluator(BaseEvaluator):
    def __init__(
        self,
        embed_model: Optional[BaseEmbedding] = None,
        similarity_fn: Optional[Callable[..., float]] = None,
        similarity_mode: Optional[SimilarityMode] = None,
        similarity_threshold: float = 0.8,
    ) -> None: ...

    async def aevaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        reference: Optional[str] = None,
        **kwargs: Any,
    ) -> EvaluationResult: ...

Import

from llama_index.core.evaluation.semantic_similarity import SemanticSimilarityEvaluator

I/O Contract

Inputs

Name Type Required Description
embed_model Optional[BaseEmbedding] No The embedding model to use. Defaults to Settings.embed_model.
similarity_fn Optional[Callable[..., float]] No Custom similarity function. Mutually exclusive with similarity_mode.
similarity_mode Optional[SimilarityMode] No Similarity computation mode (e.g., cosine). Mutually exclusive with similarity_fn. Defaults to SimilarityMode.DEFAULT.
similarity_threshold float No Minimum similarity score to pass. Defaults to 0.8.
response str Yes (aevaluate) The generated response to evaluate.
reference str Yes (aevaluate) The reference answer to compare against.

Outputs

Name Type Description
result EvaluationResult Contains the similarity score (float), passing (bool based on threshold), and feedback string with the similarity score.

Usage Examples

from llama_index.core.evaluation.semantic_similarity import SemanticSimilarityEvaluator

# Create the evaluator with default settings
evaluator = SemanticSimilarityEvaluator(
    similarity_threshold=0.8,
)

# Evaluate response against reference
result = await evaluator.aevaluate(
    response="Paris is the capital of France.",
    reference="The capital city of France is Paris.",
)

print(f"Score: {result.score}")       # e.g., 0.95
print(f"Passing: {result.passing}")    # True (score >= 0.8)
print(f"Feedback: {result.feedback}")  # "Similarity score: 0.95"

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment