Implementation:Vibrantlabsai Ragas SummarizationScore
| Knowledge Sources | |
|---|---|
| Domains | Evaluation, Metrics |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
SummarizationScore is an LLM-based metric that evaluates the quality of a generated summary by measuring how well it preserves key information from the source text and how concise it is.
Description
The SummarizationScore metric implements a multi-step, LLM-driven evaluation pipeline for assessing summarization quality. The algorithm operates through three sequential LLM-prompting stages:
- Keyphrase Extraction -- The ExtractKeyphrasePrompt identifies key entities from the source text, including persons, organizations, locations, dates/times, monetary values, and percentages.
- Question Generation -- The GenerateQuestionsPrompt creates closed-ended (yes/no) questions based on the extracted keyphrases and source text. These questions are designed so that the answer is always "1" (yes) when evaluated against the original source.
- Answer Generation -- The GenerateAnswersPrompt evaluates whether the generated summary contains enough information to answer each question, producing "1" or "0" for each.
The final score is a weighted combination of two components:
- QA Score -- The fraction of questions that can be answered from the summary (
correct_answers / total_questions). This measures information retention.
- Conciseness Score -- Computed as
1 - min(len(summary), len(text)) / (len(text) + 1e-10), which penalizes summaries that are as long as or longer than the source text. This is optionally applied when length_penalty is enabled (default: True).
The combined score is: qa_score * (1 - coeff) + conciseness_score * coeff where coeff defaults to 0.5.
Usage
Use SummarizationScore to evaluate whether LLM-generated summaries capture the essential facts from the original text. This metric is particularly useful for assessing abstractive or extractive summarization pipelines. It requires an LLM to be configured (via the MetricWithLLM mixin) and expects reference_contexts (the source documents) and response (the summary) in the input sample.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File: src/ragas/metrics/_summarization.py
Signature
@dataclass
class SummarizationScore(MetricWithLLM, SingleTurnMetric):
name: str = "summary_score"
max_retries: int = 1
length_penalty: bool = True
_required_columns: t.Dict[MetricType, t.Set[str]] = field(
default_factory=lambda: {
MetricType.SINGLE_TURN: {
"reference_contexts",
"response",
}
}
)
output_type: t.Optional[MetricOutputType] = MetricOutputType.CONTINUOUS
coeff: float = 0.5
question_generation_prompt: PydanticPrompt = field(
default_factory=GenerateQuestionsPrompt
)
answer_generation_prompt: PydanticPrompt = field(
default_factory=GenerateAnswersPrompt
)
extract_keyphrases_prompt: PydanticPrompt = field(
default_factory=ExtractKeyphrasePrompt
)
Import
from ragas.metrics._summarization import SummarizationScore, summarization_score
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| reference_contexts | List[str] | Yes | The source text passages to be summarized (joined with newlines internally) |
| response | str | Yes | The generated summary to evaluate |
| length_penalty | bool | No | Whether to apply a conciseness penalty (default: True) |
| coeff | float | No | Weight for the conciseness score in the final combination (default: 0.5) |
| max_retries | int | No | Maximum number of retries for LLM calls (default: 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| score | float | A continuous score between 0.0 and 1.0 representing summarization quality. Higher scores indicate better information retention and conciseness. |
Internal Prompts
The metric uses three Pydantic-based prompt classes:
| Prompt Class | Purpose | Input | Output |
|---|---|---|---|
| ExtractKeyphrasePrompt | Extracts key entities from source text | StringIO(text) | ExtractedKeyphrases(keyphrases) |
| GenerateQuestionsPrompt | Generates yes/no questions from text and keyphrases | GenerateQuestionsPromptInput(text, keyphrases) | QuestionsGenerated(questions) |
| GenerateAnswersPrompt | Evaluates whether summary can answer questions | SummaryAndQuestions(summary, questions) | AnswersGenerated(answers) |
Usage Examples
Basic Usage
from ragas.metrics._summarization import SummarizationScore
from ragas.dataset_schema import SingleTurnSample
# Initialize the metric (requires LLM to be set)
metric = SummarizationScore()
# metric.llm = your_llm_instance
sample = SingleTurnSample(
reference_contexts=[
"Apple Inc. is a technology company based in Cupertino, California. "
"Founded by Steve Jobs in 1976, it reached a market capitalization "
"of $3 trillion in 2023."
],
response="Apple Inc., founded by Steve Jobs in 1976, is a Cupertino-based "
"tech company valued at $3 trillion as of 2023."
)
score = await metric._single_turn_ascore(sample, callbacks=None)
print(f"Summarization score: {score}")
Using the Pre-instantiated Instance
from ragas.metrics._summarization import summarization_score
# summarization_score is a pre-instantiated SummarizationScore()
# Set the LLM before using
# summarization_score.llm = your_llm_instance
Disabling Length Penalty
from ragas.metrics._summarization import SummarizationScore
# Only evaluate information retention without conciseness penalty
metric = SummarizationScore(length_penalty=False)