Implementation:Vibrantlabsai Ragas Annotated Summary Sample
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Sample Data, Summary Accuracy |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A sample annotated summary dataset in JSON format used for demonstrating and testing the Ragas summary_accuracy metric with human-reviewed annotations.
Description
This file contains annotated evaluation samples organized under the key summary_accuracy. Each sample evaluates whether an LLM-generated summary accurately captures the key information from an original text passage. The dataset focuses on business and financial content including earnings reports, market analyses, supply chain discussions, marketing strategies, and regional growth trends.
Key characteristics of this data:
- Summarization-focused evaluation: Each sample contains a
user_inputfield with the instruction "summarise given text" followed by the source text, and aresponsefield with the generated summary. - No reference answer: Unlike the answer correctness data, these samples evaluate summaries without a separate reference, relying on the source text itself as the ground truth.
- Business domain content: The passages cover topics such as Q2 earnings reports, European market expansion, supply chain challenges, marketing campaign shifts, logistics investments, and market share analysis.
- Acceptance filtering: A notable proportion of samples have
is_acceptedset tofalse, indicating they were filtered out during the annotation process due to quality concerns or disagreements.
The dataset demonstrates how summary accuracy evaluation differs from other metric types: the judge must determine whether the summary faithfully represents the source text without omitting critical details such as specific percentages, time periods, or geographic regions.
Usage
This data file is used in the Ragas documentation to illustrate how annotated data for the summary accuracy metric is structured. It provides reference examples for users who want to build their own annotated datasets for evaluating and aligning summarization quality metrics.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File:
docs/_static/sample_annotated_summary.json
Data Schema
{
"summary_accuracy": [
{
"metric_input": {
"user_input": "summarise given text\nThe Q2 earnings report revealed...",
"response": "The Q2 earnings report showed a 15% revenue increase..."
},
"metric_output": 1,
"prompts": {
"single_turn_aspect_critic_prompt": {
"prompt_input": {
"user_input": "summarise given text\n...",
"response": "...",
"retrieved_contexts": null,
"reference_contexts": null,
"reference": null
},
"prompt_output": {
"reason": "The summary accurately captures the key points...",
"verdict": 1
},
"edited_output": null
}
},
"is_accepted": true
}
]
}
I/O Contract
Structure
| Field | Type | Description |
|---|---|---|
| summary_accuracy | Array | Top-level key containing all annotated samples for the summary accuracy metric |
| metric_input | Object | Contains user_input (string with summarization instruction plus source text) and response (string with the generated summary)
|
| metric_output | Integer (0 or 1) | The final binary verdict indicating whether the summary is accurate (1) or inaccurate (0) |
| prompts | Object | Contains the prompt trace with single_turn_aspect_critic_prompt
|
| prompts.single_turn_aspect_critic_prompt.prompt_input | Object | The full input sent to the LLM judge, including user_input, response, retrieved_contexts, reference_contexts, and reference (all context/reference fields are null for summarization)
|
| prompts.single_turn_aspect_critic_prompt.prompt_output | Object | The LLM judge output with reason (string explaining the assessment) and verdict (integer)
|
| prompts.single_turn_aspect_critic_prompt.edited_output | Object or null | Human-edited correction with reason and verdict, or null if no correction was needed
|
| is_accepted | Boolean | Whether the annotated sample was accepted as a valid training or evaluation example |
Usage Examples
Loading the Data
import json
with open("docs/_static/sample_annotated_summary.json") as f:
data = json.load(f)
# Access all summary accuracy samples
summary_samples = data["summary_accuracy"]
# Separate accepted and rejected samples
accepted = [s for s in summary_samples if s["is_accepted"]]
rejected = [s for s in summary_samples if not s["is_accepted"]]
# Count pass/fail verdicts
passing = [s for s in summary_samples if s["metric_output"] == 1]
failing = [s for s in summary_samples if s["metric_output"] == 0]
print(f"Total samples: {len(summary_samples)}")
print(f"Accepted: {len(accepted)}, Rejected: {len(rejected)}")
print(f"Passing (accurate): {len(passing)}, Failing (inaccurate): {len(failing)}")
Examining Failed Summaries
import json
with open("docs/_static/sample_annotated_summary.json") as f:
data = json.load(f)
# Find summaries judged as inaccurate
for sample in data["summary_accuracy"]:
if sample["metric_output"] == 0:
reason = sample["prompts"]["single_turn_aspect_critic_prompt"]["prompt_output"]["reason"]
source_text = sample["metric_input"]["user_input"].replace("summarise given text\n", "")
summary = sample["metric_input"]["response"]
print(f"Source: {source_text[:80]}...")
print(f"Summary: {summary[:80]}...")
print(f"Failure reason: {reason}")
print()