Implementation:OpenBMB UltraFeedback Instruction Data Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Construction |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
Concrete tool for loading instruction data from JSON files into HuggingFace Dataset objects, as used in the UltraFeedback generation pipeline.
Description
The instruction data loading pattern in UltraFeedback uses Python's built-in json.load to read pre-prepared JSON files, converts them to a pandas DataFrame, and then wraps the result in a HuggingFace datasets.Dataset object. This two-step conversion (JSON → DataFrame → Dataset) is used because the JSON files contain lists of dictionaries that map naturally to tabular data, while the HuggingFace Dataset provides efficient .map() functionality for downstream batch processing.
The HuggingFace Transformers pipeline (main.py) additionally supports sharding via .select(range(start, end)) for distributing work across multiple processes.
Usage
Import this pattern when you need to load the pre-prepared instruction JSON files for the UltraFeedback completion generation pipeline. Both the HuggingFace backend (main.py) and the vLLM backend (main_vllm.py) use this same loading pattern.
Code Reference
Source Location
- Repository: UltraFeedback
- File: src/comparison_data_generation/main.py (Lines 246-248)
- File: src/comparison_data_generation/main_vllm.py (Lines 210-214)
- File: src/comparison_data_generation/sampling.py (Lines 16-22)
Signature
# Pattern used in main.py (HuggingFace backend)
load_path = f"./completion_data/{subset}.json"
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset)).select(
range(id * 2000, min((id + 1) * 2000, len(dataset)))
)
# Pattern used in main_vllm.py (vLLM backend)
load_path = f"./completion_data/{subset}.json"
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset))
# Pattern used in sampling.py
dataset = pd.read_json(f"./completion_data/{subset}.json", lines=True)
dataset = Dataset.from_pandas(pd.DataFrame(dataset))
Import
import json
import pandas as pd
import datasets
# or: from datasets import Dataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| load_path | str | Yes | Path to JSON file, e.g. "./completion_data/sharegpt.json" |
| subset | str | Yes | Dataset subset name (sharegpt, flan, evol_instruct, ultrachat, truthful_qa, false_qa) |
| id | int | No | Shard ID for parallel processing (HF backend only). Selects range [id*2000, (id+1)*2000) |
Outputs
| Name | Type | Description |
|---|---|---|
| dataset | datasets.Dataset | HuggingFace Dataset with fields: instruction (str), plus any source-specific fields (correct_answers, incorrect_answers for TruthfulQA) |
Usage Examples
Basic Loading (vLLM Backend)
import json
import pandas as pd
import datasets
subset = "sharegpt"
load_path = f"./completion_data/{subset}.json"
# Load JSON → DataFrame → HuggingFace Dataset
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset))
print(len(dataset)) # Number of instructions
print(dataset[0]["instruction"]) # First instruction text
Sharded Loading (HuggingFace Backend)
import json
import pandas as pd
import datasets
subset = "truthful_qa"
shard_id = 0
load_path = f"./completion_data/{subset}.json"
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset)).select(
range(shard_id * 2000, min((shard_id + 1) * 2000, len(dataset)))
)
print(f"Shard {shard_id}: {len(dataset)} examples")