Principle:Deepset ai Haystack Semantic Answer Similarity Evaluation

Overview

Semantic Answer Similarity (SAS) compares predicted answers to ground truth answers using embedding-based similarity rather than exact string matching. It overcomes the fundamental limitations of lexical overlap metrics by capturing meaning equivalence even when the wording differs.

Domains

Evaluation
NLP

Theoretical Foundation

Traditional answer evaluation metrics (exact match, F1, BLEU) rely on surface-level text overlap. This creates systematic failures:

"The capital of France is Paris" vs. "Paris" -- semantically equivalent but low lexical overlap.
"NYC" vs. "New York City" -- identical meaning, different tokens.
"He passed away in 1890" vs. "He died in 1890" -- paraphrases with different vocabulary.

SAS addresses these limitations by computing similarity in a learned semantic embedding space.

Bi-Encoder Approach

Uses a Sentence Transformer (bi-encoder) model:

Encode the predicted answer into an embedding vector.
Encode the ground truth answer into an embedding vector.
Compute cosine similarity between the two vectors.

SAS(pred, truth) = cosine_similarity(encode(pred), encode(truth))

Advantages: Fast, efficient for batch evaluation. Embeddings can be pre-computed.

Limitations: May miss nuanced differences that require direct comparison.

Cross-Encoder Approach

Uses a Cross-Encoder model:

Concatenate the predicted and ground truth answers as a single input pair.
The model jointly attends to both sequences and outputs a similarity score.

SAS(pred, truth) = cross_encoder([pred, truth])

Advantages: More accurate for subtle semantic distinctions, as the model can directly compare tokens across both sequences.

Limitations: Slower than bi-encoders since pairs cannot be pre-computed independently.

Model Selection

The choice between bi-encoder and cross-encoder is determined automatically based on the model architecture:

Models with a ForSequenceClassification architecture head are treated as cross-encoders.
All other models are treated as bi-encoders (Sentence Transformers).

Score Normalization

Cross-encoder scores may exceed the [0, 1] range. When raw scores exceed 1.0, a sigmoid (expit) function is applied to normalize them to [0, 1].

When to Use SAS

RAG answer evaluation: When generated answers are semantically correct but phrased differently from ground truth.
Multilingual evaluation: When using multilingual embedding models that can compare answers across languages.
Paraphrase-tolerant evaluation: When exact match metrics would unfairly penalize correct but differently-worded answers.

Limitations

Requires a pre-trained embedding model, adding a dependency on model quality and availability.
Model-dependent: different models may produce different scores for the same pair.
May not capture domain-specific nuances without fine-tuning.
Requires GPU for efficient batch evaluation with large models.

Relationship to Implementation

In the Haystack framework, this principle is realized by the SASEvaluator component, which:

Supports both bi-encoder (SentenceTransformer) and cross-encoder models.
Auto-detects model type from the Hugging Face model configuration.
Computes per-pair similarity scores and aggregates to a mean SAS score.

Related Principles

Faithfulness Evaluation -- uses LLM judgment rather than embedding similarity.
Context Relevance Evaluation -- evaluates context quality rather than answer quality.

References

Reimers, N. & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP.
Sentence Transformers Documentation

Related Pages

Implemented By

Implementation:Deepset_ai_Haystack_SASEvaluator

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment