Implementation:LLMBook zh LLMBook zh github io PTDataset

Knowledge Sources	LLMBook-zh HuggingFace Datasets
Domains	NLP, Data_Engineering
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for pre-training data preparation that tokenizes, concatenates, and chunks text into fixed-length sequences provided by the LLMBook repository.

Description

The PTDataset class loads a text dataset, tokenizes it using a HuggingFace tokenizer, concatenates all token sequences into a continuous stream, and chunks them into fixed-length blocks. It produces (input_ids, labels) pairs where labels are copies of input_ids (the model internally handles the shift for next-token prediction).

Usage

Import this class when setting up pre-training data for a causal language model using HuggingFace Trainer. Pass it as train_dataset to the Trainer.

Code Reference

Source Location

Repository: LLMBook-zh
File: code/6.3 预训练数据类.py
Lines: 6-52

Signature

class PTDataset:
    def __init__(self, args, tokenizer):
        """
        Args:
            args: Training arguments with 'dataset' (file path) and 'model_max_length' attributes.
            tokenizer: HuggingFace AutoTokenizer instance.
        """

    def __len__(self) -> int:
        """Returns number of training examples (blocks)."""

    def __getitem__(self, i) -> dict:
        """Returns dict(input_ids=Tensor, labels=Tensor) for block i."""

    def encode(self, examples: dict) -> dict:
        """Tokenizes text examples using the tokenizer."""

    def group_texts(self, examples: list) -> list:
        """Concatenates all token sequences and chunks into blocks of block_size."""

    def process(self) -> list:
        """Loads dataset, tokenizes, and returns list of token ID tensors."""

Import

from dataset.pt_dataset import PTDataset

I/O Contract

Inputs

Name	Type	Required	Description
args	Arguments	Yes	Training arguments with dataset path and model_max_length
tokenizer	AutoTokenizer	Yes	HuggingFace tokenizer for encoding text

Outputs

Name	Type	Description
__getitem__ returns	dict	dict(input_ids=Tensor[block_size], labels=Tensor[block_size])
input_ids	list[Tensor]	List of token ID tensors, each of length block_size
labels	list[Tensor]	Copy of input_ids (causal LM target)

Usage Examples

from transformers import AutoTokenizer
from dataset.pt_dataset import PTDataset

# Setup
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

class Args:
    dataset = "path/to/train.txt"
    model_max_length = 2048

args = Args()

# Create dataset
dataset = PTDataset(args, tokenizer)
print(f"Number of training blocks: {len(dataset)}")

# Access a single example
example = dataset[0]
print(f"input_ids shape: {example['input_ids'].shape}")
print(f"labels shape: {example['labels'].shape}")

Related Pages

Implements Principle

Principle:LLMBook_zh_LLMBook_zh_github_io_Pretraining_Dataset_Preparation

Requires Environment

Environment:LLMBook_zh_LLMBook_zh_github_io_HuggingFace_Transformers_Stack

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment