Implementation:Deepset ai Haystack SASEvaluator

Overview

SASEvaluator is a Haystack evaluator component that computes the Semantic Answer Similarity (SAS) between predicted answers and ground truth answers. It uses pre-trained models from the Hugging Face model hub, supporting both bi-encoder (SentenceTransformer) and cross-encoder architectures.

Implements Principle

Principle:Deepset_ai_Haystack_Semantic_Answer_Similarity_Evaluation

Source Location

haystack/components/evaluators/sas_evaluator.py (Lines 20-189)

Import

from haystack.components.evaluators import SASEvaluator

Component Registration

SASEvaluator is decorated with @component, making it a standard Haystack pipeline component.

Dependencies

sentence-transformers (version >= 5.0.0) -- Install via: pip install "sentence-transformers>=5.0.0"
transformers -- Used for AutoConfig to detect model architecture.
numpy -- Used for computing the mean score.

API

Constructor

def __init__(
    self,
    model: str = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
    batch_size: int = 32,
    device: ComponentDevice | None = None,
    token: Secret = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], strict=False),
) -> None:

Parameters:

model (str, default: "sentence-transformers/paraphrase-multilingual-mpnet-base-v2") -- Path or name of a SentenceTransformers or CrossEncoder model from Hugging Face Hub.
batch_size (int, default: 32) -- Number of prediction-label pairs to encode at once.
device (ComponentDevice | None, default: None) -- The device on which the model is loaded. If None, the default device is automatically selected.
token (Secret) -- Hugging Face token for HTTP bearer authorization. Resolved from HF_API_TOKEN or HF_TOKEN environment variables.

warm_up()

def warm_up(self) -> None:

Initializes the similarity model. Loads either a CrossEncoder or SentenceTransformer based on the model's architecture configuration. This is called automatically if not invoked before run().

run()

def run(
    self,
    ground_truth_answers: list[str],
    predicted_answers: list[str]
) -> dict[str, float | list[float]]:

Parameters:

ground_truth_answers (list[str]) -- A list of expected answers for each question.
predicted_answers (list[str]) -- A list of generated answers for each question.

Returns: A dictionary with the following keys:

score (float) -- Mean SAS score over all prediction/ground-truth pairs.
individual_scores (list[float]) -- A list of similarity scores for each pair.

Raises:

ValueError -- If the number of predictions and labels differ, or if predicted answers contain None values.

to_dict() / from_dict()

Serialization methods for pipeline export and import.

Algorithm

Model Type Detection

During warm_up(), the model architecture is inspected via AutoConfig:

If any architecture name ends with "ForSequenceClassification", the model is loaded as a CrossEncoder.
Otherwise, it is loaded as a SentenceTransformer.

Bi-Encoder (SentenceTransformer) Path

Encode predicted answers into embeddings.
Encode ground truth answers into embeddings.
Compute cosine similarity for each pair.

Cross-Encoder Path

Create sentence pairs from predicted and ground truth answers.
Pass pairs through the cross-encoder's predict() method.
If any scores exceed 1.0, apply sigmoid normalization (expit) to map to [0, 1].

Usage Example

from haystack.components.evaluators import SASEvaluator

evaluator = SASEvaluator(model="cross-encoder/ms-marco-MiniLM-L-6-v2")
evaluator.warm_up()

ground_truths = [
    "A construction budget of US $2.3 billion",
    "The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.",
    "The Meiji Restoration in 1868 transformed Japan into a modernized world power.",
]
predictions = [
    "A construction budget of US $2.3 billion",
    "The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.",
    "The Meiji Restoration in 1868 transformed Japan into a modernized world power.",
]

result = evaluator.run(
    ground_truth_answers=ground_truths,
    predicted_answers=predictions,
)

print(result["score"])
# 0.9999673763910929
print(result["individual_scores"])
# [0.9999765157699585, 0.999968409538269, 0.9999572038650513]

Integration in Evaluation Pipelines

from haystack import Pipeline
from haystack.components.evaluators import SASEvaluator

eval_pipeline = Pipeline()
eval_pipeline.add_component("sas", SASEvaluator())

results = eval_pipeline.run({
    "sas": {
        "ground_truth_answers": ground_truths,
        "predicted_answers": predictions,
    },
})

Important Notes

warm_up() must be called: The model must be loaded before evaluation. If warm_up() has not been called explicitly, it is invoked automatically on the first run() call.
Empty input handling: If the input lists are empty, the evaluator returns a score of 0.0 with [0.0] as individual scores.
No None values: Predicted answers must not contain None values; a ValueError is raised otherwise.
GPU recommended: For large evaluation batches, running on GPU significantly improves performance. Use the device parameter to control placement.

Related Pages

Implements Principle

Principle:Deepset_ai_Haystack_Semantic_Answer_Similarity_Evaluation

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment