Implementation:Run llama Llama index SemanticSimilarityEvaluator
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Similarity |
| Last Updated | 2026-02-11 19:00 GMT |
Overview
Evaluates the quality of a generated response by computing the embedding similarity between the response and a reference answer, without requiring an LLM judge.
Description
The SemanticSimilarityEvaluator is a concrete implementation of BaseEvaluator that measures response quality by comparing the semantic similarity of the generated response to a known reference answer. It is inspired by the paper "Semantic Answer Similarity for Evaluating Question Answering Models" (https://arxiv.org/pdf/2108.06130.pdf).
The evaluator works as follows:
- It embeds both the response and reference strings using the configured embedding model (BaseEmbedding).
- It computes a similarity score between the two embedding vectors using a configurable similarity function.
- It determines a passing result by checking if the similarity score meets or exceeds the similarity_threshold.
The similarity function can be configured in two ways:
- By providing a similarity_mode (a SimilarityMode enum value) which uses the built-in similarity function with that mode. The default mode is SimilarityMode.DEFAULT.
- By providing a custom similarity_fn callable that accepts two embedding vectors and returns a float. Note that similarity_mode and similarity_fn are mutually exclusive.
Unlike LLM-based evaluators, this evaluator does not require an LLM and has no prompts (_get_prompts returns an empty dict). It ignores the query and contexts parameters, requiring only response and reference.
Usage
Use this evaluator when you need a fast, deterministic, and cost-effective way to evaluate response quality against reference answers. It is particularly useful for regression testing, large-scale evaluation benchmarks where LLM judge calls would be prohibitively expensive, or as a complementary metric alongside LLM-based evaluators.
Code Reference
Source Location
- Repository: Run_llama_Llama_index
- File: llama-index-core/llama_index/core/evaluation/semantic_similarity.py
Signature
class SemanticSimilarityEvaluator(BaseEvaluator):
def __init__(
self,
embed_model: Optional[BaseEmbedding] = None,
similarity_fn: Optional[Callable[..., float]] = None,
similarity_mode: Optional[SimilarityMode] = None,
similarity_threshold: float = 0.8,
) -> None: ...
async def aevaluate(
self,
query: Optional[str] = None,
response: Optional[str] = None,
contexts: Optional[Sequence[str]] = None,
reference: Optional[str] = None,
**kwargs: Any,
) -> EvaluationResult: ...
Import
from llama_index.core.evaluation.semantic_similarity import SemanticSimilarityEvaluator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| embed_model | Optional[BaseEmbedding] | No | The embedding model to use. Defaults to Settings.embed_model. |
| similarity_fn | Optional[Callable[..., float]] | No | Custom similarity function. Mutually exclusive with similarity_mode. |
| similarity_mode | Optional[SimilarityMode] | No | Similarity computation mode (e.g., cosine). Mutually exclusive with similarity_fn. Defaults to SimilarityMode.DEFAULT. |
| similarity_threshold | float | No | Minimum similarity score to pass. Defaults to 0.8. |
| response | str | Yes (aevaluate) | The generated response to evaluate. |
| reference | str | Yes (aevaluate) | The reference answer to compare against. |
Outputs
| Name | Type | Description |
|---|---|---|
| result | EvaluationResult | Contains the similarity score (float), passing (bool based on threshold), and feedback string with the similarity score. |
Usage Examples
from llama_index.core.evaluation.semantic_similarity import SemanticSimilarityEvaluator
# Create the evaluator with default settings
evaluator = SemanticSimilarityEvaluator(
similarity_threshold=0.8,
)
# Evaluate response against reference
result = await evaluator.aevaluate(
response="Paris is the capital of France.",
reference="The capital city of France is Paris.",
)
print(f"Score: {result.score}") # e.g., 0.95
print(f"Passing: {result.passing}") # True (score >= 0.8)
print(f"Feedback: {result.feedback}") # "Similarity score: 0.95"