Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas Annotated Summary Sample

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Sample Data, Summary Accuracy
Last Updated 2026-02-12 00:00 GMT

Overview

A sample annotated summary dataset in JSON format used for demonstrating and testing the Ragas summary_accuracy metric with human-reviewed annotations.

Description

This file contains annotated evaluation samples organized under the key summary_accuracy. Each sample evaluates whether an LLM-generated summary accurately captures the key information from an original text passage. The dataset focuses on business and financial content including earnings reports, market analyses, supply chain discussions, marketing strategies, and regional growth trends.

Key characteristics of this data:

  • Summarization-focused evaluation: Each sample contains a user_input field with the instruction "summarise given text" followed by the source text, and a response field with the generated summary.
  • No reference answer: Unlike the answer correctness data, these samples evaluate summaries without a separate reference, relying on the source text itself as the ground truth.
  • Business domain content: The passages cover topics such as Q2 earnings reports, European market expansion, supply chain challenges, marketing campaign shifts, logistics investments, and market share analysis.
  • Acceptance filtering: A notable proportion of samples have is_accepted set to false, indicating they were filtered out during the annotation process due to quality concerns or disagreements.

The dataset demonstrates how summary accuracy evaluation differs from other metric types: the judge must determine whether the summary faithfully represents the source text without omitting critical details such as specific percentages, time periods, or geographic regions.

Usage

This data file is used in the Ragas documentation to illustrate how annotated data for the summary accuracy metric is structured. It provides reference examples for users who want to build their own annotated datasets for evaluating and aligning summarization quality metrics.

Code Reference

Source Location

Data Schema

{
  "summary_accuracy": [
    {
      "metric_input": {
        "user_input": "summarise given text\nThe Q2 earnings report revealed...",
        "response": "The Q2 earnings report showed a 15% revenue increase..."
      },
      "metric_output": 1,
      "prompts": {
        "single_turn_aspect_critic_prompt": {
          "prompt_input": {
            "user_input": "summarise given text\n...",
            "response": "...",
            "retrieved_contexts": null,
            "reference_contexts": null,
            "reference": null
          },
          "prompt_output": {
            "reason": "The summary accurately captures the key points...",
            "verdict": 1
          },
          "edited_output": null
        }
      },
      "is_accepted": true
    }
  ]
}

I/O Contract

Structure

Field Type Description
summary_accuracy Array Top-level key containing all annotated samples for the summary accuracy metric
metric_input Object Contains user_input (string with summarization instruction plus source text) and response (string with the generated summary)
metric_output Integer (0 or 1) The final binary verdict indicating whether the summary is accurate (1) or inaccurate (0)
prompts Object Contains the prompt trace with single_turn_aspect_critic_prompt
prompts.single_turn_aspect_critic_prompt.prompt_input Object The full input sent to the LLM judge, including user_input, response, retrieved_contexts, reference_contexts, and reference (all context/reference fields are null for summarization)
prompts.single_turn_aspect_critic_prompt.prompt_output Object The LLM judge output with reason (string explaining the assessment) and verdict (integer)
prompts.single_turn_aspect_critic_prompt.edited_output Object or null Human-edited correction with reason and verdict, or null if no correction was needed
is_accepted Boolean Whether the annotated sample was accepted as a valid training or evaluation example

Usage Examples

Loading the Data

import json

with open("docs/_static/sample_annotated_summary.json") as f:
    data = json.load(f)

# Access all summary accuracy samples
summary_samples = data["summary_accuracy"]

# Separate accepted and rejected samples
accepted = [s for s in summary_samples if s["is_accepted"]]
rejected = [s for s in summary_samples if not s["is_accepted"]]

# Count pass/fail verdicts
passing = [s for s in summary_samples if s["metric_output"] == 1]
failing = [s for s in summary_samples if s["metric_output"] == 0]

print(f"Total samples: {len(summary_samples)}")
print(f"Accepted: {len(accepted)}, Rejected: {len(rejected)}")
print(f"Passing (accurate): {len(passing)}, Failing (inaccurate): {len(failing)}")

Examining Failed Summaries

import json

with open("docs/_static/sample_annotated_summary.json") as f:
    data = json.load(f)

# Find summaries judged as inaccurate
for sample in data["summary_accuracy"]:
    if sample["metric_output"] == 0:
        reason = sample["prompts"]["single_turn_aspect_critic_prompt"]["prompt_output"]["reason"]
        source_text = sample["metric_input"]["user_input"].replace("summarise given text\n", "")
        summary = sample["metric_input"]["response"]
        print(f"Source: {source_text[:80]}...")
        print(f"Summary: {summary[:80]}...")
        print(f"Failure reason: {reason}")
        print()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment