Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LLMBook zh LLMBook zh github io PTDataset

From Leeroopedia


Knowledge Sources
Domains NLP, Data_Engineering
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for pre-training data preparation that tokenizes, concatenates, and chunks text into fixed-length sequences provided by the LLMBook repository.

Description

The PTDataset class loads a text dataset, tokenizes it using a HuggingFace tokenizer, concatenates all token sequences into a continuous stream, and chunks them into fixed-length blocks. It produces (input_ids, labels) pairs where labels are copies of input_ids (the model internally handles the shift for next-token prediction).

Usage

Import this class when setting up pre-training data for a causal language model using HuggingFace Trainer. Pass it as train_dataset to the Trainer.

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/6.3 预训练数据类.py
  • Lines: 6-52

Signature

class PTDataset:
    def __init__(self, args, tokenizer):
        """
        Args:
            args: Training arguments with 'dataset' (file path) and 'model_max_length' attributes.
            tokenizer: HuggingFace AutoTokenizer instance.
        """

    def __len__(self) -> int:
        """Returns number of training examples (blocks)."""

    def __getitem__(self, i) -> dict:
        """Returns dict(input_ids=Tensor, labels=Tensor) for block i."""

    def encode(self, examples: dict) -> dict:
        """Tokenizes text examples using the tokenizer."""

    def group_texts(self, examples: list) -> list:
        """Concatenates all token sequences and chunks into blocks of block_size."""

    def process(self) -> list:
        """Loads dataset, tokenizes, and returns list of token ID tensors."""

Import

from dataset.pt_dataset import PTDataset

I/O Contract

Inputs

Name Type Required Description
args Arguments Yes Training arguments with dataset path and model_max_length
tokenizer AutoTokenizer Yes HuggingFace tokenizer for encoding text

Outputs

Name Type Description
__getitem__ returns dict dict(input_ids=Tensor[block_size], labels=Tensor[block_size])
input_ids list[Tensor] List of token ID tensors, each of length block_size
labels list[Tensor] Copy of input_ids (causal LM target)

Usage Examples

from transformers import AutoTokenizer
from dataset.pt_dataset import PTDataset

# Setup
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

class Args:
    dataset = "path/to/train.txt"
    model_max_length = 2048

args = Args()

# Create dataset
dataset = PTDataset(args, tokenizer)
print(f"Number of training blocks: {len(dataset)}")

# Access a single example
example = dataset[0]
print(f"input_ids shape: {example['input_ids'].shape}")
print(f"labels shape: {example['labels'].shape}")

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment