Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenBMB UltraFeedback Completion Data Loading

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Construction
Last Updated 2023-10-02 00:00 GMT

Overview

A data ingestion strategy for loading completed instruction-response pairs from JSON files or HuggingFace Hub into the annotation pipeline.

Description

Completion Data Loading is the entry point for the GPT-4 annotation pipeline. After the completion generation phase produces JSON files containing instructions paired with model responses, the annotation scripts need to load this data for processing.

Three distinct loading patterns are used across the annotation scripts:

  1. Critique annotation: Loads from local JSON files in an annotation/ directory using json.loadpd.DataFramedatasets.Dataset.from_pandas
  2. Preference annotation: Loads from local JSON files in the completion_data/ directory using the same pattern
  3. Score correction: Loads the published dataset directly from HuggingFace Hub using datasets.load_dataset("openbmb/UltraFeedback")

The distinction between local file loading and Hub loading reflects the pipeline's lifecycle: early stages work with local intermediate files, while the correction step operates on the published dataset.

Usage

Use the local JSON loading pattern during active dataset construction (critique and preference annotation). Use the HuggingFace Hub loading pattern for post-publication corrections or analysis.

Theoretical Basis

The loading patterns follow a standard ETL (Extract-Transform-Load) approach:

  • Extract: Read raw JSON or download from Hub
  • Transform: Convert to DataFrame then Dataset for uniform interface
  • Load: Provide a HuggingFace Dataset with .map() support for batch processing

Pseudo-code Logic:

# Pattern 1: Local JSON loading (critique/preference annotation)
data = json.load(open(path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(data))

# Pattern 2: HuggingFace Hub loading (score correction)
dataset = load_dataset("openbmb/UltraFeedback")["train"]

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment