Implementation:Deepset ai Haystack SASEvaluator
Overview
SASEvaluator is a Haystack evaluator component that computes the Semantic Answer Similarity (SAS) between predicted answers and ground truth answers. It uses pre-trained models from the Hugging Face model hub, supporting both bi-encoder (SentenceTransformer) and cross-encoder architectures.
Implements Principle
Principle:Deepset_ai_Haystack_Semantic_Answer_Similarity_Evaluation
Source Location
haystack/components/evaluators/sas_evaluator.py (Lines 20-189)
Import
from haystack.components.evaluators import SASEvaluator
Component Registration
SASEvaluator is decorated with @component, making it a standard Haystack pipeline component.
Dependencies
- sentence-transformers (version >= 5.0.0) -- Install via:
pip install "sentence-transformers>=5.0.0" - transformers -- Used for
AutoConfigto detect model architecture. - numpy -- Used for computing the mean score.
API
Constructor
def __init__(
self,
model: str = "sentence-transformers/paraphrase-multilingual-mpnet-base-v2",
batch_size: int = 32,
device: ComponentDevice | None = None,
token: Secret = Secret.from_env_var(["HF_API_TOKEN", "HF_TOKEN"], strict=False),
) -> None:
Parameters:
- model (
str, default:"sentence-transformers/paraphrase-multilingual-mpnet-base-v2") -- Path or name of a SentenceTransformers or CrossEncoder model from Hugging Face Hub. - batch_size (
int, default:32) -- Number of prediction-label pairs to encode at once. - device (
ComponentDevice | None, default:None) -- The device on which the model is loaded. IfNone, the default device is automatically selected. - token (
Secret) -- Hugging Face token for HTTP bearer authorization. Resolved fromHF_API_TOKENorHF_TOKENenvironment variables.
warm_up()
def warm_up(self) -> None:
Initializes the similarity model. Loads either a CrossEncoder or SentenceTransformer based on the model's architecture configuration. This is called automatically if not invoked before run().
run()
def run(
self,
ground_truth_answers: list[str],
predicted_answers: list[str]
) -> dict[str, float | list[float]]:
Parameters:
- ground_truth_answers (
list[str]) -- A list of expected answers for each question. - predicted_answers (
list[str]) -- A list of generated answers for each question.
Returns: A dictionary with the following keys:
- score (
float) -- Mean SAS score over all prediction/ground-truth pairs. - individual_scores (
list[float]) -- A list of similarity scores for each pair.
Raises:
ValueError-- If the number of predictions and labels differ, or if predicted answers containNonevalues.
to_dict() / from_dict()
Serialization methods for pipeline export and import.
Algorithm
Model Type Detection
During warm_up(), the model architecture is inspected via AutoConfig:
- If any architecture name ends with
"ForSequenceClassification", the model is loaded as aCrossEncoder. - Otherwise, it is loaded as a
SentenceTransformer.
Bi-Encoder (SentenceTransformer) Path
- Encode predicted answers into embeddings.
- Encode ground truth answers into embeddings.
- Compute cosine similarity for each pair.
Cross-Encoder Path
- Create sentence pairs from predicted and ground truth answers.
- Pass pairs through the cross-encoder's
predict()method. - If any scores exceed 1.0, apply sigmoid normalization (
expit) to map to [0, 1].
Usage Example
from haystack.components.evaluators import SASEvaluator
evaluator = SASEvaluator(model="cross-encoder/ms-marco-MiniLM-L-6-v2")
evaluator.warm_up()
ground_truths = [
"A construction budget of US $2.3 billion",
"The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.",
"The Meiji Restoration in 1868 transformed Japan into a modernized world power.",
]
predictions = [
"A construction budget of US $2.3 billion",
"The Eiffel Tower, completed in 1889, symbolizes Paris's cultural magnificence.",
"The Meiji Restoration in 1868 transformed Japan into a modernized world power.",
]
result = evaluator.run(
ground_truth_answers=ground_truths,
predicted_answers=predictions,
)
print(result["score"])
# 0.9999673763910929
print(result["individual_scores"])
# [0.9999765157699585, 0.999968409538269, 0.9999572038650513]
Integration in Evaluation Pipelines
from haystack import Pipeline
from haystack.components.evaluators import SASEvaluator
eval_pipeline = Pipeline()
eval_pipeline.add_component("sas", SASEvaluator())
results = eval_pipeline.run({
"sas": {
"ground_truth_answers": ground_truths,
"predicted_answers": predictions,
},
})
Important Notes
- warm_up() must be called: The model must be loaded before evaluation. If
warm_up()has not been called explicitly, it is invoked automatically on the firstrun()call. - Empty input handling: If the input lists are empty, the evaluator returns a score of 0.0 with
[0.0]as individual scores. - No None values: Predicted answers must not contain
Nonevalues; aValueErroris raised otherwise. - GPU recommended: For large evaluation batches, running on GPU significantly improves performance. Use the
deviceparameter to control placement.