Principle:LLMBook zh LLMBook zh github io Pretraining Dataset Preparation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
A data preparation technique that tokenizes raw text, concatenates all tokens, and chunks them into fixed-length sequences for causal language model pre-training.
Description
Pretraining Dataset Preparation converts raw text corpora into training-ready tensor sequences. The process involves three stages: (1) tokenization of individual text examples using a pre-trained tokenizer, (2) concatenation of all tokenized sequences into a single stream, and (3) chunking the stream into fixed-length blocks matching the model's context window. This approach maximizes training efficiency by eliminating padding and ensuring every token contributes to learning.
Usage
Use this principle when preparing data for causal language model pre-training. The concatenation-and-chunking approach is standard for GPT-style models and should be used when the training data consists of independent text documents that should be packed into fixed-length sequences.
Theoretical Basis
The preparation pipeline follows three steps:
- Tokenize: Convert each text to token IDs using the model's tokenizer.
- Concatenate: Join all token sequences into a single continuous stream.
- Chunk: Split the stream into non-overlapping segments of length block_size (typically the model's max context window).
Pseudo-code:
# Abstract algorithm (NOT real implementation)
all_tokens = []
for text in corpus:
tokens = tokenizer.encode(text)
all_tokens.extend(tokens)
# Chunk into blocks
num_blocks = len(all_tokens) // block_size
blocks = [all_tokens[i*block_size:(i+1)*block_size] for i in range(num_blocks)]
# For causal LM: labels = input_ids (shifted internally by the model)
dataset = [(block, block) for block in blocks]