Principle:Deepset ai Haystack Semantic Answer Similarity Evaluation
Overview
Semantic Answer Similarity (SAS) compares predicted answers to ground truth answers using embedding-based similarity rather than exact string matching. It overcomes the fundamental limitations of lexical overlap metrics by capturing meaning equivalence even when the wording differs.
Domains
- Evaluation
- NLP
Theoretical Foundation
Traditional answer evaluation metrics (exact match, F1, BLEU) rely on surface-level text overlap. This creates systematic failures:
- "The capital of France is Paris" vs. "Paris" -- semantically equivalent but low lexical overlap.
- "NYC" vs. "New York City" -- identical meaning, different tokens.
- "He passed away in 1890" vs. "He died in 1890" -- paraphrases with different vocabulary.
SAS addresses these limitations by computing similarity in a learned semantic embedding space.
Bi-Encoder Approach
Uses a Sentence Transformer (bi-encoder) model:
- Encode the predicted answer into an embedding vector.
- Encode the ground truth answer into an embedding vector.
- Compute cosine similarity between the two vectors.
SAS(pred, truth) = cosine_similarity(encode(pred), encode(truth))
Advantages: Fast, efficient for batch evaluation. Embeddings can be pre-computed.
Limitations: May miss nuanced differences that require direct comparison.
Cross-Encoder Approach
Uses a Cross-Encoder model:
- Concatenate the predicted and ground truth answers as a single input pair.
- The model jointly attends to both sequences and outputs a similarity score.
SAS(pred, truth) = cross_encoder([pred, truth])
Advantages: More accurate for subtle semantic distinctions, as the model can directly compare tokens across both sequences.
Limitations: Slower than bi-encoders since pairs cannot be pre-computed independently.
Model Selection
The choice between bi-encoder and cross-encoder is determined automatically based on the model architecture:
- Models with a
ForSequenceClassificationarchitecture head are treated as cross-encoders. - All other models are treated as bi-encoders (Sentence Transformers).
Score Normalization
Cross-encoder scores may exceed the [0, 1] range. When raw scores exceed 1.0, a sigmoid (expit) function is applied to normalize them to [0, 1].
When to Use SAS
- RAG answer evaluation: When generated answers are semantically correct but phrased differently from ground truth.
- Multilingual evaluation: When using multilingual embedding models that can compare answers across languages.
- Paraphrase-tolerant evaluation: When exact match metrics would unfairly penalize correct but differently-worded answers.
Limitations
- Requires a pre-trained embedding model, adding a dependency on model quality and availability.
- Model-dependent: different models may produce different scores for the same pair.
- May not capture domain-specific nuances without fine-tuning.
- Requires GPU for efficient batch evaluation with large models.
Relationship to Implementation
In the Haystack framework, this principle is realized by the SASEvaluator component, which:
- Supports both bi-encoder (SentenceTransformer) and cross-encoder models.
- Auto-detects model type from the Hugging Face model configuration.
- Computes per-pair similarity scores and aggregates to a mean SAS score.
Related Principles
- Faithfulness Evaluation -- uses LLM judgment rather than embedding similarity.
- Context Relevance Evaluation -- evaluates context quality rather than answer quality.
References
- Reimers, N. & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." EMNLP.
- Sentence Transformers Documentation