Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:LLMBook zh LLMBook zh github io Pretraining Dataset Preparation

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

A data preparation technique that tokenizes raw text, concatenates all tokens, and chunks them into fixed-length sequences for causal language model pre-training.

Description

Pretraining Dataset Preparation converts raw text corpora into training-ready tensor sequences. The process involves three stages: (1) tokenization of individual text examples using a pre-trained tokenizer, (2) concatenation of all tokenized sequences into a single stream, and (3) chunking the stream into fixed-length blocks matching the model's context window. This approach maximizes training efficiency by eliminating padding and ensuring every token contributes to learning.

Usage

Use this principle when preparing data for causal language model pre-training. The concatenation-and-chunking approach is standard for GPT-style models and should be used when the training data consists of independent text documents that should be packed into fixed-length sequences.

Theoretical Basis

The preparation pipeline follows three steps:

  1. Tokenize: Convert each text to token IDs using the model's tokenizer.
  2. Concatenate: Join all token sequences into a single continuous stream.
  3. Chunk: Split the stream into non-overlapping segments of length block_size (typically the model's max context window).

Pseudo-code:

# Abstract algorithm (NOT real implementation)
all_tokens = []
for text in corpus:
    tokens = tokenizer.encode(text)
    all_tokens.extend(tokens)

# Chunk into blocks
num_blocks = len(all_tokens) // block_size
blocks = [all_tokens[i*block_size:(i+1)*block_size] for i in range(num_blocks)]

# For causal LM: labels = input_ids (shifted internally by the model)
dataset = [(block, block) for block in blocks]

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment