Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas Edited Chain Runs Sample

From Leeroopedia
Knowledge Sources
Domains LLM Evaluation, Sample Data, Answer Correctness
Last Updated 2026-02-12 00:00 GMT

Overview

A sample dataset of chain execution runs in JSON format, containing annotated evaluation traces for the Ragas answer_correctness metric with human-edited judgments.

Description

This file contains annotated evaluation samples organized under the key answer_correctness. Each sample represents a complete evaluation trace for a metric that assesses whether an LLM response correctly answers a question compared to a reference answer. The dataset structure mirrors the annotation workflow used in Ragas for training and aligning metrics.

Key characteristics of this data:

  • Reference-based evaluation: Unlike the helpfulness metric, each sample includes a reference field in the metric input, providing the ground truth answer to compare against.
  • Scientific and educational content: The samples cover diverse scientific topics such as Sensory Adaptation, Evolutionary Fitness, Sediment Transport, Isostasy, Digital Computation, Quantum Decoherence, Hawking Radiation, Special Relativity, Quantum Mechanics, Abiogenesis, and more.
  • Simplified response style: The LLM responses use simplified, child-friendly language, making it easy to identify cases where oversimplification leads to factual inaccuracies.
  • Human-edited corrections: Some samples include an edited_output field where a human reviewer corrected the LLM judge's reasoning, often catching subtle factual errors missed by the initial automated judgment.

The dataset demonstrates cases where the automated judge initially gave a passing verdict but a human reviewer correctly identified factual errors (such as confusing "water" with "air" or "expansion" with "getting smaller"), showing the importance of human-in-the-loop annotation.

Usage

This data file is used in the Ragas documentation to demonstrate how annotated chain run data is structured for the answer correctness evaluation workflow. It serves as a reference for users who want to understand how reference-based metric alignment works, particularly for scenarios where human reviewers refine the automated LLM judge's assessments.

Code Reference

Source Location

Data Schema

{
  "answer_correctness": [
    {
      "metric_input": {
        "user_input": "What is the Theory of Sensory Adaptation...?",
        "response": "The Theory of Sensory Adaptation is like...",
        "reference": "The Theory of Sensory Adaptation refers to..."
      },
      "metric_output": 1,
      "prompts": {
        "single_turn_aspect_critic_prompt": {
          "prompt_input": {
            "user_input": "...",
            "response": "...",
            "retrieved_contexts": null,
            "reference_contexts": null,
            "reference": "..."
          },
          "prompt_output": {
            "reason": "The response accurately explains...",
            "verdict": 1
          },
          "is_accepted": true,
          "edited_output": null
        }
      },
      "is_accepted": true
    }
  ]
}

I/O Contract

Structure

Field Type Description
answer_correctness Array Top-level key containing all annotated samples for the answer correctness metric
metric_input Object Contains user_input (string), response (string), and reference (string) representing the evaluation triple
metric_input.reference String The ground truth reference answer used to evaluate correctness of the response
metric_output Integer (0 or 1) The final binary verdict indicating whether the response is correct (1) or incorrect (0)
prompts Object Contains the prompt trace with single_turn_aspect_critic_prompt
prompts.single_turn_aspect_critic_prompt.prompt_input Object The full input sent to the LLM judge, including user_input, response, retrieved_contexts, reference_contexts, and reference
prompts.single_turn_aspect_critic_prompt.prompt_output Object The original LLM judge output with reason (string) and verdict (integer)
prompts.single_turn_aspect_critic_prompt.is_accepted Boolean Whether the individual prompt-level annotation was accepted
prompts.single_turn_aspect_critic_prompt.edited_output Object or null Human-edited correction with reason and verdict, or null if no correction was needed
is_accepted Boolean Whether the overall annotated sample was accepted as valid

Usage Examples

Loading the Data

import json

with open("docs/_static/edited_chain_runs.json") as f:
    data = json.load(f)

# Access all answer correctness samples
correctness_samples = data["answer_correctness"]

# Filter samples where human editors corrected the judge
corrected = [
    s for s in correctness_samples
    if s["prompts"]["single_turn_aspect_critic_prompt"]["edited_output"] is not None
]

# Find samples where human correction changed the verdict
verdict_changes = [
    s for s in corrected
    if s["prompts"]["single_turn_aspect_critic_prompt"]["edited_output"]["verdict"]
    != s["prompts"]["single_turn_aspect_critic_prompt"]["prompt_output"]["verdict"]
]

print(f"Total samples: {len(correctness_samples)}")
print(f"Human-corrected samples: {len(corrected)}")
print(f"Verdict changes: {len(verdict_changes)}")

Analyzing Annotation Quality

import json

with open("docs/_static/edited_chain_runs.json") as f:
    data = json.load(f)

for sample in data["answer_correctness"]:
    prompt = sample["prompts"]["single_turn_aspect_critic_prompt"]
    edited = prompt.get("edited_output")
    if edited and edited["verdict"] != prompt["prompt_output"]["verdict"]:
        print(f"Question: {sample['metric_input']['user_input'][:60]}...")
        print(f"  Original verdict: {prompt['prompt_output']['verdict']}")
        print(f"  Corrected verdict: {edited['verdict']}")
        print(f"  Correction reason: {edited['reason']}")
        print()

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment