Implementation:OpenBMB UltraFeedback Annotation Data Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Construction |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
Concrete tool for loading completion data from local JSON files or HuggingFace Hub for the annotation pipeline.
Description
Three annotation scripts use different loading patterns:
annotate_critique.py (L89-97): Loads from annotation/{subset}.json using json.load → pd.DataFrame → datasets.Dataset.from_pandas. Iterates over 6 subsets.
annotate_preference.py (L147-153): Loads from ../comparison_data_generation/completion_data/{subset}.json using json.load → pd.DataFrame. Does not convert to HuggingFace Dataset; iterates directly over the DataFrame.
fix_overall_score_issue.py (L107-110): Loads from HuggingFace Hub using load_dataset("openbmb/UltraFeedback")["train"]. Operates on the published dataset.
Usage
Each script is run independently. The critique script processes all 6 subsets in sequence. The preference script processes configured subsets. The fix script operates on the full published dataset.
Code Reference
Source Location
- Repository: UltraFeedback
- File: src/data_annotation/annotate_critique.py (Lines 89-97)
- File: src/data_annotation/annotate_preference.py (Lines 147-153)
- File: src/data_annotation/fix_overall_score_issue.py (Lines 107-110)
Signature
# Critique annotation loading (annotate_critique.py:L89-97)
subsets = ["sharegpt", "flan", "evol_instruct", "ultrachat", "truthful_qa", "false_qa"]
for subset in subsets:
with open(os.path.join("annotation", subset + ".json"), "r") as f:
dataset = json.load(f)
dataset = pd.DataFrame(dataset)
dataset = datasets.Dataset.from_pandas(dataset)
# Preference annotation loading (annotate_preference.py:L147-153)
for subset in subsets:
with open(os.path.join("../comparison_data_generation", "completion_data",
subset + ".json"), "r") as f:
dataset = json.load(f)
dataset = pd.DataFrame(dataset)
# Score correction loading (fix_overall_score_issue.py:L107-110)
from datasets import load_dataset
dataset = load_dataset("openbmb/UltraFeedback")["train"]
Import
import json
import os
import pandas as pd
import datasets
from datasets import load_dataset # for Hub loading
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| subset | str | Yes | Dataset subset name (for local loading) |
| JSON path | str | Yes | Path to local JSON file (critique: annotation/{subset}.json, preference: ../completion_data/{subset}.json) |
| HuggingFace ID | str | No | Dataset ID on HuggingFace Hub ("openbmb/UltraFeedback") for score correction |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | Union[datasets.Dataset, pd.DataFrame] | Loaded data with instruction, completions (each with model, principle, custom_system_prompt, response) |
Usage Examples
Critique Annotation Loading
import json
import os
import pandas as pd
import datasets
subset = "sharegpt"
with open(os.path.join("annotation", subset + ".json"), "r") as f:
data = json.load(f)
dataset = pd.DataFrame(data)
dataset = datasets.Dataset.from_pandas(dataset)
print(len(dataset))
print(dataset[0].keys()) # instruction, completions, ...
HuggingFace Hub Loading
from datasets import load_dataset
dataset = load_dataset("openbmb/UltraFeedback")["train"]
print(len(dataset)) # Full published dataset
print(dataset[0]["completions"][0].keys())