Principle:LLMBook zh LLMBook zh github io Pretraining Dataset Preparation

Knowledge Sources	Language Models are Unsupervised Multitask Learners LLMBook-zh
Domains	NLP, Data_Engineering
Last Updated	2026-02-08 00:00 GMT

Overview

A data preparation technique that tokenizes raw text, concatenates all tokens, and chunks them into fixed-length sequences for causal language model pre-training.

Description

Pretraining Dataset Preparation converts raw text corpora into training-ready tensor sequences. The process involves three stages: (1) tokenization of individual text examples using a pre-trained tokenizer, (2) concatenation of all tokenized sequences into a single stream, and (3) chunking the stream into fixed-length blocks matching the model's context window. This approach maximizes training efficiency by eliminating padding and ensuring every token contributes to learning.

Usage

Use this principle when preparing data for causal language model pre-training. The concatenation-and-chunking approach is standard for GPT-style models and should be used when the training data consists of independent text documents that should be packed into fixed-length sequences.

Theoretical Basis

The preparation pipeline follows three steps:

Tokenize: Convert each text to token IDs using the model's tokenizer.
Concatenate: Join all token sequences into a single continuous stream.
Chunk: Split the stream into non-overlapping segments of length block_size (typically the model's max context window).

Pseudo-code:

# Abstract algorithm (NOT real implementation)
all_tokens = []
for text in corpus:
    tokens = tokenizer.encode(text)
    all_tokens.extend(tokens)

# Chunk into blocks
num_blocks = len(all_tokens) // block_size
blocks = [all_tokens[i*block_size:(i+1)*block_size] for i in range(num_blocks)]

# For causal LM: labels = input_ids (shifted internally by the model)
dataset = [(block, block) for block in blocks]

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_PTDataset

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment