Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas Annotated Data Sample

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Sample Data, Metric Annotation
Last Updated 2026-02-12 00:00 GMT

Overview

A sample annotated evaluation dataset in JSON format used for demonstrating and testing the Ragas helpfulness metric with human-reviewed annotations.

Description

This file contains a collection of annotated evaluation samples organized under the key helpfulness. Each sample represents a complete evaluation trace for the AspectCritic metric (specifically configured as a helpfulness judge). The dataset includes:

  • metric_input: The original user input and the LLM response being evaluated.
  • metric_output: The binary verdict (1 for helpful, 0 for not helpful).
  • prompts: The full prompt trace including single_turn_aspect_critic_prompt with the prompt input, prompt output (containing a reason and verdict), and optionally an edited_output with human-corrected reasoning.
  • is_accepted: A boolean flag indicating whether the annotated sample was accepted as a valid training/evaluation example.

The dataset covers diverse evaluation scenarios including text improvement requests, anagram solving, factual questions, vacation recommendations, and text editing tasks. It contains both positive examples (verdict = 1, response is helpful) and negative examples (verdict = 0, response is unhelpful), making it suitable for training or aligning an LLM-as-a-judge metric.

Usage

This data file is used in the Ragas documentation to illustrate how annotated evaluation data is structured for the metric alignment and training workflow. It serves as a reference for users who want to create their own annotated datasets for fine-tuning or calibrating Ragas metrics. The file is also referenced in tutorials that demonstrate how to align an LLM as a judge using human-annotated feedback.

Code Reference

Source Location

Data Schema

{
  "helpfulness": [
    {
      "metric_input": {
        "user_input": "can you fix this up better?...",
        "response": "Dear Sir,..."
      },
      "metric_output": 1,
      "prompts": {
        "single_turn_aspect_critic_prompt": {
          "prompt_input": {
            "user_input": "...",
            "response": "...",
            "retrieved_contexts": null,
            "reference_contexts": null,
            "reference": null
          },
          "prompt_output": {
            "reason": "The response effectively addresses...",
            "verdict": 1
          },
          "edited_output": {
            "reason": "The response is helpful because...",
            "verdict": 1
          }
        }
      },
      "is_accepted": true
    }
  ]
}

I/O Contract

Structure

Field Type Description
helpfulness Array Top-level key containing all annotated samples for the helpfulness metric
metric_input Object Contains user_input (string) and response (string) representing the evaluation pair
metric_output Integer (0 or 1) The final binary verdict for the metric evaluation
prompts Object Contains the prompt trace with single_turn_aspect_critic_prompt
prompts.single_turn_aspect_critic_prompt.prompt_input Object The full input sent to the LLM judge, including user_input, response, retrieved_contexts, reference_contexts, and reference
prompts.single_turn_aspect_critic_prompt.prompt_output Object The original LLM judge output with reason (string) and verdict (integer)
prompts.single_turn_aspect_critic_prompt.edited_output Object or null Human-edited correction of the judge output, containing reason and verdict, or null if no edit was made
is_accepted Boolean Whether the annotated sample was accepted as valid for training or evaluation

Usage Examples

Loading the Data

import json

with open("docs/_static/annotated_data.json") as f:
    data = json.load(f)

# Access all helpfulness samples
helpfulness_samples = data["helpfulness"]

# Filter accepted samples only
accepted = [s for s in helpfulness_samples if s["is_accepted"]]

# Filter samples with human edits
edited = [
    s for s in helpfulness_samples
    if s["prompts"]["single_turn_aspect_critic_prompt"]["edited_output"] is not None
]

print(f"Total samples: {len(helpfulness_samples)}")
print(f"Accepted samples: {len(accepted)}")
print(f"Samples with edits: {len(edited)}")

Using with Ragas Metric Training

from ragas.metrics import AspectCritic

# Load annotated data for metric alignment
import json

with open("docs/_static/annotated_data.json") as f:
    annotated_data = json.load(f)

# The data structure matches what Ragas expects
# for training and aligning LLM-as-a-judge metrics

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment