Principle:Sail sg LongSpec Data Preparation

Knowledge Sources	LongSpec HuggingFace Datasets
Domains	Data_Engineering, NLP, Training
Last Updated	2026-02-14 05:00 GMT

Overview

Principle for transforming raw training corpora into model-ready datasets using composable read functions, alignment transformations, and template-based formatting.

Description

Data Preparation in the GLIDE training pipeline involves a modular, composable system for loading and formatting training data. Raw data (typically JSONL files) is processed through three stages:

Reading: Factory functions (jsonl_read_fn, json_read_fn) parse raw files into structured records
Alignment: Aligner functions (add_id_aligner, GLIDEAligner) transform records by adding metadata, restructuring fields, or filtering
Template formatting: String templates map raw fields to model-expected input format (e.g., combining instruction + context + response)

The system is Hydra-instantiable, meaning all components (reader, aligner, template) are specified in YAML configuration files and composed at runtime. This allows swapping data sources and preprocessing strategies without code changes.

The pipeline also supports distributed data splitting (split_size/split_id) for multi-GPU training and optional data limiting (max_data_num) for debugging.

Usage

Apply this principle when setting up training data loading for any GLIDE training stage. The data pipeline is defined entirely in Hydra YAML config and instantiated by the training entry point. Different training stages use different data sources, templates, and collators:

Stage 1: SlimPajama-6B with basic text concatenation
Stage 2: Long-context data (32k sequences) with no-mask collators
Stage 3: Long Chain-of-Thought data with specialized CoT collators

Theoretical Basis

The composable data pipeline follows the builder pattern: each component (reader, aligner, template) is a configurable factory that returns a callable. These are composed via Hydra's _target_ mechanism:

# Abstract pipeline (not actual implementation)
reader = hydra.utils.instantiate(cfg.read_fn)       # e.g., jsonl_read_fn()
aligner = hydra.utils.instantiate(cfg.aligner)       # e.g., add_id_aligner()
template = recompose_template(cfg.units, cfg.compositions)

dataset = MultiMappingDataset(
    file_path=cfg.file_path,
    tokenizer=tokenizer,
    template=template,
    aligner=aligner,
    read_fn=reader,
)

Data flows through:

Raw file → Reader → List[Dict]
List[Dict] → Aligner → List[Dict] (with metadata)
Dict → Template → Formatted string → Collator → Tensor batch

Related Pages

Implemented By

Implementation:Sail_sg_LongSpec_MultiMappingDataset

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment