Implementation:OpenBMB UltraFeedback Annotation Data Loading

Knowledge Sources	UltraFeedback HuggingFace Datasets
Domains	NLP, Data_Construction
Last Updated	2023-10-02 00:00 GMT

Overview

Concrete tool for loading completion data from local JSON files or HuggingFace Hub for the annotation pipeline.

Description

Three annotation scripts use different loading patterns:

annotate_critique.py (L89-97): Loads from annotation/{subset}.json using json.load → pd.DataFrame → datasets.Dataset.from_pandas. Iterates over 6 subsets.

annotate_preference.py (L147-153): Loads from ../comparison_data_generation/completion_data/{subset}.json using json.load → pd.DataFrame. Does not convert to HuggingFace Dataset; iterates directly over the DataFrame.

fix_overall_score_issue.py (L107-110): Loads from HuggingFace Hub using load_dataset("openbmb/UltraFeedback")["train"]. Operates on the published dataset.

Usage

Each script is run independently. The critique script processes all 6 subsets in sequence. The preference script processes configured subsets. The fix script operates on the full published dataset.

Code Reference

Source Location

Repository: UltraFeedback
File: src/data_annotation/annotate_critique.py (Lines 89-97)
File: src/data_annotation/annotate_preference.py (Lines 147-153)
File: src/data_annotation/fix_overall_score_issue.py (Lines 107-110)

Signature

# Critique annotation loading (annotate_critique.py:L89-97)
subsets = ["sharegpt", "flan", "evol_instruct", "ultrachat", "truthful_qa", "false_qa"]
for subset in subsets:
    with open(os.path.join("annotation", subset + ".json"), "r") as f:
        dataset = json.load(f)
    dataset = pd.DataFrame(dataset)
    dataset = datasets.Dataset.from_pandas(dataset)

# Preference annotation loading (annotate_preference.py:L147-153)
for subset in subsets:
    with open(os.path.join("../comparison_data_generation", "completion_data",
                           subset + ".json"), "r") as f:
        dataset = json.load(f)
    dataset = pd.DataFrame(dataset)

# Score correction loading (fix_overall_score_issue.py:L107-110)
from datasets import load_dataset
dataset = load_dataset("openbmb/UltraFeedback")["train"]

Import

import json
import os
import pandas as pd
import datasets
from datasets import load_dataset  # for Hub loading

I/O Contract

Inputs

Name	Type	Required	Description
subset	str	Yes	Dataset subset name (for local loading)
JSON path	str	Yes	Path to local JSON file (critique: annotation/{subset}.json, preference: ../completion_data/{subset}.json)
HuggingFace ID	str	No	Dataset ID on HuggingFace Hub ("openbmb/UltraFeedback") for score correction

Outputs

Name	Type	Description
dataset	Union[datasets.Dataset, pd.DataFrame]	Loaded data with instruction, completions (each with model, principle, custom_system_prompt, response)

Usage Examples

Critique Annotation Loading

import json
import os
import pandas as pd
import datasets

subset = "sharegpt"
with open(os.path.join("annotation", subset + ".json"), "r") as f:
    data = json.load(f)
dataset = pd.DataFrame(data)
dataset = datasets.Dataset.from_pandas(dataset)

print(len(dataset))
print(dataset[0].keys())  # instruction, completions, ...

HuggingFace Hub Loading

from datasets import load_dataset

dataset = load_dataset("openbmb/UltraFeedback")["train"]
print(len(dataset))  # Full published dataset
print(dataset[0]["completions"][0].keys())

Related Pages

Implements Principle

Principle:OpenBMB_UltraFeedback_Completion_Data_Loading

Requires Environment

Environment:OpenBMB_UltraFeedback_HuggingFace_Hub_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment