Implementation:OpenBMB UltraFeedback Instruction Data Loading

Knowledge Sources	UltraFeedback HuggingFace Datasets
Domains	NLP, Data_Construction
Last Updated	2023-10-02 00:00 GMT

Overview

Concrete tool for loading instruction data from JSON files into HuggingFace Dataset objects, as used in the UltraFeedback generation pipeline.

Description

The instruction data loading pattern in UltraFeedback uses Python's built-in json.load to read pre-prepared JSON files, converts them to a pandas DataFrame, and then wraps the result in a HuggingFace datasets.Dataset object. This two-step conversion (JSON → DataFrame → Dataset) is used because the JSON files contain lists of dictionaries that map naturally to tabular data, while the HuggingFace Dataset provides efficient .map() functionality for downstream batch processing.

The HuggingFace Transformers pipeline (main.py) additionally supports sharding via .select(range(start, end)) for distributing work across multiple processes.

Usage

Import this pattern when you need to load the pre-prepared instruction JSON files for the UltraFeedback completion generation pipeline. Both the HuggingFace backend (main.py) and the vLLM backend (main_vllm.py) use this same loading pattern.

Code Reference

Source Location

Repository: UltraFeedback
File: src/comparison_data_generation/main.py (Lines 246-248)
File: src/comparison_data_generation/main_vllm.py (Lines 210-214)
File: src/comparison_data_generation/sampling.py (Lines 16-22)

Signature

# Pattern used in main.py (HuggingFace backend)
load_path = f"./completion_data/{subset}.json"
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset)).select(
    range(id * 2000, min((id + 1) * 2000, len(dataset)))
)

# Pattern used in main_vllm.py (vLLM backend)
load_path = f"./completion_data/{subset}.json"
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset))

# Pattern used in sampling.py
dataset = pd.read_json(f"./completion_data/{subset}.json", lines=True)
dataset = Dataset.from_pandas(pd.DataFrame(dataset))

Import

import json
import pandas as pd
import datasets
# or: from datasets import Dataset

I/O Contract

Inputs

Name	Type	Required	Description
load_path	str	Yes	Path to JSON file, e.g. "./completion_data/sharegpt.json"
subset	str	Yes	Dataset subset name (sharegpt, flan, evol_instruct, ultrachat, truthful_qa, false_qa)
id	int	No	Shard ID for parallel processing (HF backend only). Selects range [id2000, (id+1)2000)

Outputs

Name	Type	Description
dataset	datasets.Dataset	HuggingFace Dataset with fields: instruction (str), plus any source-specific fields (correct_answers, incorrect_answers for TruthfulQA)

Usage Examples

Basic Loading (vLLM Backend)

import json
import pandas as pd
import datasets

subset = "sharegpt"
load_path = f"./completion_data/{subset}.json"

# Load JSON → DataFrame → HuggingFace Dataset
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset))

print(len(dataset))  # Number of instructions
print(dataset[0]["instruction"])  # First instruction text

Sharded Loading (HuggingFace Backend)

import json
import pandas as pd
import datasets

subset = "truthful_qa"
shard_id = 0
load_path = f"./completion_data/{subset}.json"

dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset)).select(
    range(shard_id * 2000, min((shard_id + 1) * 2000, len(dataset)))
)

print(f"Shard {shard_id}: {len(dataset)} examples")

Related Pages

Implements Principle

Principle:OpenBMB_UltraFeedback_Instruction_Sampling

Requires Environment

Environment:OpenBMB_UltraFeedback_HuggingFace_Hub_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment