Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Deepset ai Haystack Semantic Answer Similarity Evaluation

From Leeroopedia

Overview

Semantic Answer Similarity (SAS) compares predicted answers to ground truth answers using embedding-based similarity rather than exact string matching. It overcomes the fundamental limitations of lexical overlap metrics by capturing meaning equivalence even when the wording differs.

Domains

  • Evaluation
  • NLP

Theoretical Foundation

Traditional answer evaluation metrics (exact match, F1, BLEU) rely on surface-level text overlap. This creates systematic failures:

  • "The capital of France is Paris" vs. "Paris" -- semantically equivalent but low lexical overlap.
  • "NYC" vs. "New York City" -- identical meaning, different tokens.
  • "He passed away in 1890" vs. "He died in 1890" -- paraphrases with different vocabulary.

SAS addresses these limitations by computing similarity in a learned semantic embedding space.

Bi-Encoder Approach

Uses a Sentence Transformer (bi-encoder) model:

  1. Encode the predicted answer into an embedding vector.
  2. Encode the ground truth answer into an embedding vector.
  3. Compute cosine similarity between the two vectors.
SAS(pred, truth) = cosine_similarity(encode(pred), encode(truth))

Advantages: Fast, efficient for batch evaluation. Embeddings can be pre-computed.

Limitations: May miss nuanced differences that require direct comparison.

Cross-Encoder Approach

Uses a Cross-Encoder model:

  1. Concatenate the predicted and ground truth answers as a single input pair.
  2. The model jointly attends to both sequences and outputs a similarity score.
SAS(pred, truth) = cross_encoder([pred, truth])

Advantages: More accurate for subtle semantic distinctions, as the model can directly compare tokens across both sequences.

Limitations: Slower than bi-encoders since pairs cannot be pre-computed independently.

Model Selection

The choice between bi-encoder and cross-encoder is determined automatically based on the model architecture:

  • Models with a ForSequenceClassification architecture head are treated as cross-encoders.
  • All other models are treated as bi-encoders (Sentence Transformers).

Score Normalization

Cross-encoder scores may exceed the [0, 1] range. When raw scores exceed 1.0, a sigmoid (expit) function is applied to normalize them to [0, 1].

When to Use SAS

  • RAG answer evaluation: When generated answers are semantically correct but phrased differently from ground truth.
  • Multilingual evaluation: When using multilingual embedding models that can compare answers across languages.
  • Paraphrase-tolerant evaluation: When exact match metrics would unfairly penalize correct but differently-worded answers.

Limitations

  • Requires a pre-trained embedding model, adding a dependency on model quality and availability.
  • Model-dependent: different models may produce different scores for the same pair.
  • May not capture domain-specific nuances without fine-tuning.
  • Requires GPU for efficient batch evaluation with large models.

Relationship to Implementation

In the Haystack framework, this principle is realized by the SASEvaluator component, which:

  • Supports both bi-encoder (SentenceTransformer) and cross-encoder models.
  • Auto-detects model type from the Hugging Face model configuration.
  • Computes per-pair similarity scores and aggregates to a mean SAS score.

Related Principles

  • Faithfulness Evaluation -- uses LLM judgment rather than embedding similarity.
  • Context Relevance Evaluation -- evaluates context quality rather than answer quality.

References

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment