Principle:OpenBMB UltraFeedback Completion Data Loading

Knowledge Sources	UltraFeedback UltraFeedback
Domains	NLP, Data_Construction
Last Updated	2023-10-02 00:00 GMT

Overview

A data ingestion strategy for loading completed instruction-response pairs from JSON files or HuggingFace Hub into the annotation pipeline.

Description

Completion Data Loading is the entry point for the GPT-4 annotation pipeline. After the completion generation phase produces JSON files containing instructions paired with model responses, the annotation scripts need to load this data for processing.

Three distinct loading patterns are used across the annotation scripts:

Critique annotation: Loads from local JSON files in an annotation/ directory using json.load → pd.DataFrame → datasets.Dataset.from_pandas
Preference annotation: Loads from local JSON files in the completion_data/ directory using the same pattern
Score correction: Loads the published dataset directly from HuggingFace Hub using datasets.load_dataset("openbmb/UltraFeedback")

The distinction between local file loading and Hub loading reflects the pipeline's lifecycle: early stages work with local intermediate files, while the correction step operates on the published dataset.

Usage

Use the local JSON loading pattern during active dataset construction (critique and preference annotation). Use the HuggingFace Hub loading pattern for post-publication corrections or analysis.

Theoretical Basis

The loading patterns follow a standard ETL (Extract-Transform-Load) approach:

Extract: Read raw JSON or download from Hub
Transform: Convert to DataFrame then Dataset for uniform interface
Load: Provide a HuggingFace Dataset with .map() support for batch processing

Pseudo-code Logic:

# Pattern 1: Local JSON loading (critique/preference annotation)
data = json.load(open(path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(data))

# Pattern 2: HuggingFace Hub loading (score correction)
dataset = load_dataset("openbmb/UltraFeedback")["train"]

Related Pages

Implemented By

Implementation:OpenBMB_UltraFeedback_Annotation_Data_Loading

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment