Principle:Hpcaitech ColossalAI Sequence Packing Dataset

Knowledge Sources	ColossalAI
Domains	NLP, Data_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

A data engineering pattern that splices multiple short sequences into constant-length packed sequences to maximize GPU utilization during pretraining.

Description

Standard pretraining data contains sequences of varying lengths. Padding short sequences wastes computation, while truncating long sequences loses information. Sequence packing (splicing) concatenates multiple short sequences to fill a target constant length, with appropriate separator tokens and loss masking to prevent cross-contamination between concatenated documents.

The ClosedToConstantLengthSplicedDataset implements a greedy bin-packing algorithm that maintains a buffer of tokenized sequences and combines them to reach the target length.

Usage

Use this principle when preparing pretraining data where most sequences are shorter than the target context length. It improves training throughput by eliminating padding.

Theoretical Basis

The packing algorithm:

Tokenize each document independently
Maintain a buffer of tokenized sequences
Greedily combine sequences until reaching target length
Apply loss masks at document boundaries
Yield packed sequences of constant length

Related Pages

Implemented By

Implementation:Hpcaitech_ColossalAI_ClosedToConstantLengthSplicedDataset

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment