Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Sail sg LongSpec Data Preparation

From Leeroopedia
Knowledge Sources
Domains Data_Engineering, NLP, Training
Last Updated 2026-02-14 05:00 GMT

Overview

Principle for transforming raw training corpora into model-ready datasets using composable read functions, alignment transformations, and template-based formatting.

Description

Data Preparation in the GLIDE training pipeline involves a modular, composable system for loading and formatting training data. Raw data (typically JSONL files) is processed through three stages:

  • Reading: Factory functions (jsonl_read_fn, json_read_fn) parse raw files into structured records
  • Alignment: Aligner functions (add_id_aligner, GLIDEAligner) transform records by adding metadata, restructuring fields, or filtering
  • Template formatting: String templates map raw fields to model-expected input format (e.g., combining instruction + context + response)

The system is Hydra-instantiable, meaning all components (reader, aligner, template) are specified in YAML configuration files and composed at runtime. This allows swapping data sources and preprocessing strategies without code changes.

The pipeline also supports distributed data splitting (split_size/split_id) for multi-GPU training and optional data limiting (max_data_num) for debugging.

Usage

Apply this principle when setting up training data loading for any GLIDE training stage. The data pipeline is defined entirely in Hydra YAML config and instantiated by the training entry point. Different training stages use different data sources, templates, and collators:

  • Stage 1: SlimPajama-6B with basic text concatenation
  • Stage 2: Long-context data (32k sequences) with no-mask collators
  • Stage 3: Long Chain-of-Thought data with specialized CoT collators

Theoretical Basis

The composable data pipeline follows the builder pattern: each component (reader, aligner, template) is a configurable factory that returns a callable. These are composed via Hydra's _target_ mechanism:

# Abstract pipeline (not actual implementation)
reader = hydra.utils.instantiate(cfg.read_fn)       # e.g., jsonl_read_fn()
aligner = hydra.utils.instantiate(cfg.aligner)       # e.g., add_id_aligner()
template = recompose_template(cfg.units, cfg.compositions)

dataset = MultiMappingDataset(
    file_path=cfg.file_path,
    tokenizer=tokenizer,
    template=template,
    aligner=aligner,
    read_fn=reader,
)

Data flows through:

  1. Raw file → Reader → List[Dict]
  2. List[Dict] → Aligner → List[Dict] (with metadata)
  3. Dict → Template → Formatted string → Collator → Tensor batch

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment