Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:OpenBMB UltraFeedback Instruction Data Loading

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Construction
Last Updated 2023-10-02 00:00 GMT

Overview

Concrete tool for loading instruction data from JSON files into HuggingFace Dataset objects, as used in the UltraFeedback generation pipeline.

Description

The instruction data loading pattern in UltraFeedback uses Python's built-in json.load to read pre-prepared JSON files, converts them to a pandas DataFrame, and then wraps the result in a HuggingFace datasets.Dataset object. This two-step conversion (JSON → DataFrame → Dataset) is used because the JSON files contain lists of dictionaries that map naturally to tabular data, while the HuggingFace Dataset provides efficient .map() functionality for downstream batch processing.

The HuggingFace Transformers pipeline (main.py) additionally supports sharding via .select(range(start, end)) for distributing work across multiple processes.

Usage

Import this pattern when you need to load the pre-prepared instruction JSON files for the UltraFeedback completion generation pipeline. Both the HuggingFace backend (main.py) and the vLLM backend (main_vllm.py) use this same loading pattern.

Code Reference

Source Location

  • Repository: UltraFeedback
  • File: src/comparison_data_generation/main.py (Lines 246-248)
  • File: src/comparison_data_generation/main_vllm.py (Lines 210-214)
  • File: src/comparison_data_generation/sampling.py (Lines 16-22)

Signature

# Pattern used in main.py (HuggingFace backend)
load_path = f"./completion_data/{subset}.json"
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset)).select(
    range(id * 2000, min((id + 1) * 2000, len(dataset)))
)

# Pattern used in main_vllm.py (vLLM backend)
load_path = f"./completion_data/{subset}.json"
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset))

# Pattern used in sampling.py
dataset = pd.read_json(f"./completion_data/{subset}.json", lines=True)
dataset = Dataset.from_pandas(pd.DataFrame(dataset))

Import

import json
import pandas as pd
import datasets
# or: from datasets import Dataset

I/O Contract

Inputs

Name Type Required Description
load_path str Yes Path to JSON file, e.g. "./completion_data/sharegpt.json"
subset str Yes Dataset subset name (sharegpt, flan, evol_instruct, ultrachat, truthful_qa, false_qa)
id int No Shard ID for parallel processing (HF backend only). Selects range [id*2000, (id+1)*2000)

Outputs

Name Type Description
dataset datasets.Dataset HuggingFace Dataset with fields: instruction (str), plus any source-specific fields (correct_answers, incorrect_answers for TruthfulQA)

Usage Examples

Basic Loading (vLLM Backend)

import json
import pandas as pd
import datasets

subset = "sharegpt"
load_path = f"./completion_data/{subset}.json"

# Load JSON → DataFrame → HuggingFace Dataset
dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset))

print(len(dataset))  # Number of instructions
print(dataset[0]["instruction"])  # First instruction text

Sharded Loading (HuggingFace Backend)

import json
import pandas as pd
import datasets

subset = "truthful_qa"
shard_id = 0
load_path = f"./completion_data/{subset}.json"

dataset = json.load(open(load_path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(dataset)).select(
    range(shard_id * 2000, min((shard_id + 1) * 2000, len(dataset)))
)

print(f"Shard {shard_id}: {len(dataset)} examples")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment