Principle:Sail sg LongSpec Data Preparation
| Knowledge Sources | |
|---|---|
| Domains | Data_Engineering, NLP, Training |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Principle for transforming raw training corpora into model-ready datasets using composable read functions, alignment transformations, and template-based formatting.
Description
Data Preparation in the GLIDE training pipeline involves a modular, composable system for loading and formatting training data. Raw data (typically JSONL files) is processed through three stages:
- Reading: Factory functions (jsonl_read_fn, json_read_fn) parse raw files into structured records
- Alignment: Aligner functions (add_id_aligner, GLIDEAligner) transform records by adding metadata, restructuring fields, or filtering
- Template formatting: String templates map raw fields to model-expected input format (e.g., combining instruction + context + response)
The system is Hydra-instantiable, meaning all components (reader, aligner, template) are specified in YAML configuration files and composed at runtime. This allows swapping data sources and preprocessing strategies without code changes.
The pipeline also supports distributed data splitting (split_size/split_id) for multi-GPU training and optional data limiting (max_data_num) for debugging.
Usage
Apply this principle when setting up training data loading for any GLIDE training stage. The data pipeline is defined entirely in Hydra YAML config and instantiated by the training entry point. Different training stages use different data sources, templates, and collators:
- Stage 1: SlimPajama-6B with basic text concatenation
- Stage 2: Long-context data (32k sequences) with no-mask collators
- Stage 3: Long Chain-of-Thought data with specialized CoT collators
Theoretical Basis
The composable data pipeline follows the builder pattern: each component (reader, aligner, template) is a configurable factory that returns a callable. These are composed via Hydra's _target_ mechanism:
# Abstract pipeline (not actual implementation)
reader = hydra.utils.instantiate(cfg.read_fn) # e.g., jsonl_read_fn()
aligner = hydra.utils.instantiate(cfg.aligner) # e.g., add_id_aligner()
template = recompose_template(cfg.units, cfg.compositions)
dataset = MultiMappingDataset(
file_path=cfg.file_path,
tokenizer=tokenizer,
template=template,
aligner=aligner,
read_fn=reader,
)
Data flows through:
- Raw file → Reader → List[Dict]
- List[Dict] → Aligner → List[Dict] (with metadata)
- Dict → Template → Formatted string → Collator → Tensor batch