Implementation:Vibrantlabsai Ragas Edited Chain Runs Sample
| Knowledge Sources | |
|---|---|
| Domains | LLM Evaluation, Sample Data, Answer Correctness |
| Last Updated | 2026-02-12 00:00 GMT |
Overview
A sample dataset of chain execution runs in JSON format, containing annotated evaluation traces for the Ragas answer_correctness metric with human-edited judgments.
Description
This file contains annotated evaluation samples organized under the key answer_correctness. Each sample represents a complete evaluation trace for a metric that assesses whether an LLM response correctly answers a question compared to a reference answer. The dataset structure mirrors the annotation workflow used in Ragas for training and aligning metrics.
Key characteristics of this data:
- Reference-based evaluation: Unlike the helpfulness metric, each sample includes a
referencefield in the metric input, providing the ground truth answer to compare against. - Scientific and educational content: The samples cover diverse scientific topics such as Sensory Adaptation, Evolutionary Fitness, Sediment Transport, Isostasy, Digital Computation, Quantum Decoherence, Hawking Radiation, Special Relativity, Quantum Mechanics, Abiogenesis, and more.
- Simplified response style: The LLM responses use simplified, child-friendly language, making it easy to identify cases where oversimplification leads to factual inaccuracies.
- Human-edited corrections: Some samples include an
edited_outputfield where a human reviewer corrected the LLM judge's reasoning, often catching subtle factual errors missed by the initial automated judgment.
The dataset demonstrates cases where the automated judge initially gave a passing verdict but a human reviewer correctly identified factual errors (such as confusing "water" with "air" or "expansion" with "getting smaller"), showing the importance of human-in-the-loop annotation.
Usage
This data file is used in the Ragas documentation to demonstrate how annotated chain run data is structured for the answer correctness evaluation workflow. It serves as a reference for users who want to understand how reference-based metric alignment works, particularly for scenarios where human reviewers refine the automated LLM judge's assessments.
Code Reference
Source Location
- Repository: Vibrantlabsai_Ragas
- File:
docs/_static/edited_chain_runs.json
Data Schema
{
"answer_correctness": [
{
"metric_input": {
"user_input": "What is the Theory of Sensory Adaptation...?",
"response": "The Theory of Sensory Adaptation is like...",
"reference": "The Theory of Sensory Adaptation refers to..."
},
"metric_output": 1,
"prompts": {
"single_turn_aspect_critic_prompt": {
"prompt_input": {
"user_input": "...",
"response": "...",
"retrieved_contexts": null,
"reference_contexts": null,
"reference": "..."
},
"prompt_output": {
"reason": "The response accurately explains...",
"verdict": 1
},
"is_accepted": true,
"edited_output": null
}
},
"is_accepted": true
}
]
}
I/O Contract
Structure
| Field | Type | Description |
|---|---|---|
| answer_correctness | Array | Top-level key containing all annotated samples for the answer correctness metric |
| metric_input | Object | Contains user_input (string), response (string), and reference (string) representing the evaluation triple
|
| metric_input.reference | String | The ground truth reference answer used to evaluate correctness of the response |
| metric_output | Integer (0 or 1) | The final binary verdict indicating whether the response is correct (1) or incorrect (0) |
| prompts | Object | Contains the prompt trace with single_turn_aspect_critic_prompt
|
| prompts.single_turn_aspect_critic_prompt.prompt_input | Object | The full input sent to the LLM judge, including user_input, response, retrieved_contexts, reference_contexts, and reference
|
| prompts.single_turn_aspect_critic_prompt.prompt_output | Object | The original LLM judge output with reason (string) and verdict (integer)
|
| prompts.single_turn_aspect_critic_prompt.is_accepted | Boolean | Whether the individual prompt-level annotation was accepted |
| prompts.single_turn_aspect_critic_prompt.edited_output | Object or null | Human-edited correction with reason and verdict, or null if no correction was needed
|
| is_accepted | Boolean | Whether the overall annotated sample was accepted as valid |
Usage Examples
Loading the Data
import json
with open("docs/_static/edited_chain_runs.json") as f:
data = json.load(f)
# Access all answer correctness samples
correctness_samples = data["answer_correctness"]
# Filter samples where human editors corrected the judge
corrected = [
s for s in correctness_samples
if s["prompts"]["single_turn_aspect_critic_prompt"]["edited_output"] is not None
]
# Find samples where human correction changed the verdict
verdict_changes = [
s for s in corrected
if s["prompts"]["single_turn_aspect_critic_prompt"]["edited_output"]["verdict"]
!= s["prompts"]["single_turn_aspect_critic_prompt"]["prompt_output"]["verdict"]
]
print(f"Total samples: {len(correctness_samples)}")
print(f"Human-corrected samples: {len(corrected)}")
print(f"Verdict changes: {len(verdict_changes)}")
Analyzing Annotation Quality
import json
with open("docs/_static/edited_chain_runs.json") as f:
data = json.load(f)
for sample in data["answer_correctness"]:
prompt = sample["prompts"]["single_turn_aspect_critic_prompt"]
edited = prompt.get("edited_output")
if edited and edited["verdict"] != prompt["prompt_output"]["verdict"]:
print(f"Question: {sample['metric_input']['user_input'][:60]}...")
print(f" Original verdict: {prompt['prompt_output']['verdict']}")
print(f" Corrected verdict: {edited['verdict']}")
print(f" Correction reason: {edited['reason']}")
print()