Implementation:Hpcaitech ColossalAI ClosedToConstantLengthSplicedDataset
Appearance
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
Iterable dataset that packs multiple tokenized sequences to constant length for efficient pretraining, provided by Colossal-LLaMA.
Description
ClosedToConstantLengthSplicedDataset implements a greedy bin-packing iterator that combines short sequences into constant-length packed sequences. It buffers tokenized samples and combines them to minimize padding waste.
Usage
Use as the dataset for continual pretraining DataLoaders after tokenizing raw text data.
Code Reference
Source Location
- Repository: ColossalAI
- File: applications/Colossal-LLaMA/colossal_llama/dataset/spliced_and_tokenized_dataset.py
- Lines: 188-302
Signature
class ClosedToConstantLengthSplicedDataset(IterableDataset):
def __init__(
self,
dataset: Dataset,
tokenizer: PreTrainedTokenizer,
max_length: int = 4096,
num_packed_sequences: int = 8,
fetch_sequence_func: Callable = None,
input_ids_field: str = "input_ids",
labels_field: str = "labels",
infinite: bool = False,
shuffle: bool = True,
error_strict: bool = False,
) -> None:
"""
Args:
dataset: Source dataset with tokenized sequences
tokenizer: Tokenizer for padding/EOS tokens
max_length: Target constant sequence length
num_packed_sequences: Buffer size multiplier
infinite: Whether to repeat indefinitely
shuffle: Whether to shuffle the buffer
"""
def __iter__(self) -> Iterator[Dict[str, List[int]]]:
"""Yield packed sequences of constant length."""
Import
from colossal_llama.dataset.spliced_and_tokenized_dataset import (
ClosedToConstantLengthSplicedDataset,
supervised_tokenize_pretrain,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| dataset | Dataset | Yes | Tokenized HuggingFace Dataset |
| tokenizer | PreTrainedTokenizer | Yes | For EOS and padding tokens |
| max_length | int | No | Target sequence length (default: 4096) |
| num_packed_sequences | int | No | Buffer multiplier (default: 8) |
Outputs
| Name | Type | Description |
|---|---|---|
| Packed sequences | Dict[str, List[int]] | Yields {"input_ids": [...], "labels": [...]} of constant length |
Usage Examples
from colossal_llama.dataset.spliced_and_tokenized_dataset import (
ClosedToConstantLengthSplicedDataset,
)
from datasets import load_from_disk
# Load tokenized dataset
tokenized_dataset = load_from_disk("/data/tokenized/pretrain")
# Create packed dataset
packed_dataset = ClosedToConstantLengthSplicedDataset(
dataset=tokenized_dataset,
tokenizer=tokenizer,
max_length=8192,
num_packed_sequences=8,
shuffle=True,
)
# Use in DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(packed_dataset, batch_size=4)
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment