Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hpcaitech ColossalAI ClosedToConstantLengthSplicedDataset

From Leeroopedia
Revision as of 15:08, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Hpcaitech_ColossalAI_ClosedToConstantLengthSplicedDataset.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains NLP, Data_Engineering
Last Updated 2026-02-09 00:00 GMT

Overview

Iterable dataset that packs multiple tokenized sequences to constant length for efficient pretraining, provided by Colossal-LLaMA.

Description

ClosedToConstantLengthSplicedDataset implements a greedy bin-packing iterator that combines short sequences into constant-length packed sequences. It buffers tokenized samples and combines them to minimize padding waste.

Usage

Use as the dataset for continual pretraining DataLoaders after tokenizing raw text data.

Code Reference

Source Location

  • Repository: ColossalAI
  • File: applications/Colossal-LLaMA/colossal_llama/dataset/spliced_and_tokenized_dataset.py
  • Lines: 188-302

Signature

class ClosedToConstantLengthSplicedDataset(IterableDataset):
    def __init__(
        self,
        dataset: Dataset,
        tokenizer: PreTrainedTokenizer,
        max_length: int = 4096,
        num_packed_sequences: int = 8,
        fetch_sequence_func: Callable = None,
        input_ids_field: str = "input_ids",
        labels_field: str = "labels",
        infinite: bool = False,
        shuffle: bool = True,
        error_strict: bool = False,
    ) -> None:
        """
        Args:
            dataset: Source dataset with tokenized sequences
            tokenizer: Tokenizer for padding/EOS tokens
            max_length: Target constant sequence length
            num_packed_sequences: Buffer size multiplier
            infinite: Whether to repeat indefinitely
            shuffle: Whether to shuffle the buffer
        """

    def __iter__(self) -> Iterator[Dict[str, List[int]]]:
        """Yield packed sequences of constant length."""

Import

from colossal_llama.dataset.spliced_and_tokenized_dataset import (
    ClosedToConstantLengthSplicedDataset,
    supervised_tokenize_pretrain,
)

I/O Contract

Inputs

Name Type Required Description
dataset Dataset Yes Tokenized HuggingFace Dataset
tokenizer PreTrainedTokenizer Yes For EOS and padding tokens
max_length int No Target sequence length (default: 4096)
num_packed_sequences int No Buffer multiplier (default: 8)

Outputs

Name Type Description
Packed sequences Dict[str, List[int]] Yields {"input_ids": [...], "labels": [...]} of constant length

Usage Examples

from colossal_llama.dataset.spliced_and_tokenized_dataset import (
    ClosedToConstantLengthSplicedDataset,
)
from datasets import load_from_disk

# Load tokenized dataset
tokenized_dataset = load_from_disk("/data/tokenized/pretrain")

# Create packed dataset
packed_dataset = ClosedToConstantLengthSplicedDataset(
    dataset=tokenized_dataset,
    tokenizer=tokenizer,
    max_length=8192,
    num_packed_sequences=8,
    shuffle=True,
)

# Use in DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(packed_dataset, batch_size=4)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment