Principle:OpenBMB UltraFeedback Completion Data Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Construction |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
A data ingestion strategy for loading completed instruction-response pairs from JSON files or HuggingFace Hub into the annotation pipeline.
Description
Completion Data Loading is the entry point for the GPT-4 annotation pipeline. After the completion generation phase produces JSON files containing instructions paired with model responses, the annotation scripts need to load this data for processing.
Three distinct loading patterns are used across the annotation scripts:
- Critique annotation: Loads from local JSON files in an annotation/ directory using json.load → pd.DataFrame → datasets.Dataset.from_pandas
- Preference annotation: Loads from local JSON files in the completion_data/ directory using the same pattern
- Score correction: Loads the published dataset directly from HuggingFace Hub using datasets.load_dataset("openbmb/UltraFeedback")
The distinction between local file loading and Hub loading reflects the pipeline's lifecycle: early stages work with local intermediate files, while the correction step operates on the published dataset.
Usage
Use the local JSON loading pattern during active dataset construction (critique and preference annotation). Use the HuggingFace Hub loading pattern for post-publication corrections or analysis.
Theoretical Basis
The loading patterns follow a standard ETL (Extract-Transform-Load) approach:
- Extract: Read raw JSON or download from Hub
- Transform: Convert to DataFrame then Dataset for uniform interface
- Load: Provide a HuggingFace Dataset with .map() support for batch processing
Pseudo-code Logic:
# Pattern 1: Local JSON loading (critique/preference annotation)
data = json.load(open(path))
dataset = datasets.Dataset.from_pandas(pd.DataFrame(data))
# Pattern 2: HuggingFace Hub loading (score correction)
dataset = load_dataset("openbmb/UltraFeedback")["train"]