Implementation:Vibrantlabsai Ragas Edited Chain Runs Sample

Knowledge Sources	Vibrantlabsai_Ragas
Domains	LLM Evaluation, Sample Data, Answer Correctness
Last Updated	2026-02-12 00:00 GMT

Overview

A sample dataset of chain execution runs in JSON format, containing annotated evaluation traces for the Ragas answer_correctness metric with human-edited judgments.

Description

This file contains annotated evaluation samples organized under the key answer_correctness. Each sample represents a complete evaluation trace for a metric that assesses whether an LLM response correctly answers a question compared to a reference answer. The dataset structure mirrors the annotation workflow used in Ragas for training and aligning metrics.

Key characteristics of this data:

Reference-based evaluation: Unlike the helpfulness metric, each sample includes a reference field in the metric input, providing the ground truth answer to compare against.
Scientific and educational content: The samples cover diverse scientific topics such as Sensory Adaptation, Evolutionary Fitness, Sediment Transport, Isostasy, Digital Computation, Quantum Decoherence, Hawking Radiation, Special Relativity, Quantum Mechanics, Abiogenesis, and more.
Simplified response style: The LLM responses use simplified, child-friendly language, making it easy to identify cases where oversimplification leads to factual inaccuracies.
Human-edited corrections: Some samples include an edited_output field where a human reviewer corrected the LLM judge's reasoning, often catching subtle factual errors missed by the initial automated judgment.

The dataset demonstrates cases where the automated judge initially gave a passing verdict but a human reviewer correctly identified factual errors (such as confusing "water" with "air" or "expansion" with "getting smaller"), showing the importance of human-in-the-loop annotation.

Usage

This data file is used in the Ragas documentation to demonstrate how annotated chain run data is structured for the answer correctness evaluation workflow. It serves as a reference for users who want to understand how reference-based metric alignment works, particularly for scenarios where human reviewers refine the automated LLM judge's assessments.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: docs/_static/edited_chain_runs.json

Data Schema

{
  "answer_correctness": [
    {
      "metric_input": {
        "user_input": "What is the Theory of Sensory Adaptation...?",
        "response": "The Theory of Sensory Adaptation is like...",
        "reference": "The Theory of Sensory Adaptation refers to..."
      },
      "metric_output": 1,
      "prompts": {
        "single_turn_aspect_critic_prompt": {
          "prompt_input": {
            "user_input": "...",
            "response": "...",
            "retrieved_contexts": null,
            "reference_contexts": null,
            "reference": "..."
          },
          "prompt_output": {
            "reason": "The response accurately explains...",
            "verdict": 1
          },
          "is_accepted": true,
          "edited_output": null
        }
      },
      "is_accepted": true
    }
  ]
}

I/O Contract

Structure

Field	Type	Description
answer_correctness	Array	Top-level key containing all annotated samples for the answer correctness metric
metric_input	Object	Contains `user_input` (string), `response` (string), and `reference` (string) representing the evaluation triple
metric_input.reference	String	The ground truth reference answer used to evaluate correctness of the response
metric_output	Integer (0 or 1)	The final binary verdict indicating whether the response is correct (1) or incorrect (0)
prompts	Object	Contains the prompt trace with `single_turn_aspect_critic_prompt`
prompts.single_turn_aspect_critic_prompt.prompt_input	Object	The full input sent to the LLM judge, including `user_input`, `response`, `retrieved_contexts`, `reference_contexts`, and `reference`
prompts.single_turn_aspect_critic_prompt.prompt_output	Object	The original LLM judge output with `reason` (string) and `verdict` (integer)
prompts.single_turn_aspect_critic_prompt.is_accepted	Boolean	Whether the individual prompt-level annotation was accepted
prompts.single_turn_aspect_critic_prompt.edited_output	Object or null	Human-edited correction with `reason` and `verdict`, or null if no correction was needed
is_accepted	Boolean	Whether the overall annotated sample was accepted as valid

Usage Examples

Loading the Data

import json

with open("docs/_static/edited_chain_runs.json") as f:
    data = json.load(f)

# Access all answer correctness samples
correctness_samples = data["answer_correctness"]

# Filter samples where human editors corrected the judge
corrected = [
    s for s in correctness_samples
    if s["prompts"]["single_turn_aspect_critic_prompt"]["edited_output"] is not None
]

# Find samples where human correction changed the verdict
verdict_changes = [
    s for s in corrected
    if s["prompts"]["single_turn_aspect_critic_prompt"]["edited_output"]["verdict"]
    != s["prompts"]["single_turn_aspect_critic_prompt"]["prompt_output"]["verdict"]
]

print(f"Total samples: {len(correctness_samples)}")
print(f"Human-corrected samples: {len(corrected)}")
print(f"Verdict changes: {len(verdict_changes)}")

Analyzing Annotation Quality

import json

with open("docs/_static/edited_chain_runs.json") as f:
    data = json.load(f)

for sample in data["answer_correctness"]:
    prompt = sample["prompts"]["single_turn_aspect_critic_prompt"]
    edited = prompt.get("edited_output")
    if edited and edited["verdict"] != prompt["prompt_output"]["verdict"]:
        print(f"Question: {sample['metric_input']['user_input'][:60]}...")
        print(f"  Original verdict: {prompt['prompt_output']['verdict']}")
        print(f"  Corrected verdict: {edited['verdict']}")
        print(f"  Correction reason: {edited['reason']}")
        print()

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment