Implementation:Hpcaitech ColossalAI ClosedToConstantLengthSplicedDataset

Knowledge Sources	ColossalAI
Domains	NLP, Data_Engineering
Last Updated	2026-02-09 00:00 GMT

Overview

Iterable dataset that packs multiple tokenized sequences to constant length for efficient pretraining, provided by Colossal-LLaMA.

Description

ClosedToConstantLengthSplicedDataset implements a greedy bin-packing iterator that combines short sequences into constant-length packed sequences. It buffers tokenized samples and combines them to minimize padding waste.

Usage

Use as the dataset for continual pretraining DataLoaders after tokenizing raw text data.

Code Reference

Source Location

Repository: ColossalAI
File: applications/Colossal-LLaMA/colossal_llama/dataset/spliced_and_tokenized_dataset.py
Lines: 188-302

Signature

class ClosedToConstantLengthSplicedDataset(IterableDataset):
    def __init__(
        self,
        dataset: Dataset,
        tokenizer: PreTrainedTokenizer,
        max_length: int = 4096,
        num_packed_sequences: int = 8,
        fetch_sequence_func: Callable = None,
        input_ids_field: str = "input_ids",
        labels_field: str = "labels",
        infinite: bool = False,
        shuffle: bool = True,
        error_strict: bool = False,
    ) -> None:
        """
        Args:
            dataset: Source dataset with tokenized sequences
            tokenizer: Tokenizer for padding/EOS tokens
            max_length: Target constant sequence length
            num_packed_sequences: Buffer size multiplier
            infinite: Whether to repeat indefinitely
            shuffle: Whether to shuffle the buffer
        """

    def __iter__(self) -> Iterator[Dict[str, List[int]]]:
        """Yield packed sequences of constant length."""

Import

from colossal_llama.dataset.spliced_and_tokenized_dataset import (
    ClosedToConstantLengthSplicedDataset,
    supervised_tokenize_pretrain,
)

I/O Contract

Inputs

Name	Type	Required	Description
dataset	Dataset	Yes	Tokenized HuggingFace Dataset
tokenizer	PreTrainedTokenizer	Yes	For EOS and padding tokens
max_length	int	No	Target sequence length (default: 4096)
num_packed_sequences	int	No	Buffer multiplier (default: 8)

Outputs

Name	Type	Description
Packed sequences	Dict[str, List[int]]	Yields {"input_ids": [...], "labels": [...]} of constant length

Usage Examples

from colossal_llama.dataset.spliced_and_tokenized_dataset import (
    ClosedToConstantLengthSplicedDataset,
)
from datasets import load_from_disk

# Load tokenized dataset
tokenized_dataset = load_from_disk("/data/tokenized/pretrain")

# Create packed dataset
packed_dataset = ClosedToConstantLengthSplicedDataset(
    dataset=tokenized_dataset,
    tokenizer=tokenizer,
    max_length=8192,
    num_packed_sequences=8,
    shuffle=True,
)

# Use in DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(packed_dataset, batch_size=4)

Related Pages

Implements Principle

Principle:Hpcaitech_ColossalAI_Sequence_Packing_Dataset

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment