Implementation:LLMBook zh LLMBook zh github io PTDataset
| Knowledge Sources | |
|---|---|
| Domains | NLP, Data_Engineering |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for pre-training data preparation that tokenizes, concatenates, and chunks text into fixed-length sequences provided by the LLMBook repository.
Description
The PTDataset class loads a text dataset, tokenizes it using a HuggingFace tokenizer, concatenates all token sequences into a continuous stream, and chunks them into fixed-length blocks. It produces (input_ids, labels) pairs where labels are copies of input_ids (the model internally handles the shift for next-token prediction).
Usage
Import this class when setting up pre-training data for a causal language model using HuggingFace Trainer. Pass it as train_dataset to the Trainer.
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/6.3 预训练数据类.py
- Lines: 6-52
Signature
class PTDataset:
def __init__(self, args, tokenizer):
"""
Args:
args: Training arguments with 'dataset' (file path) and 'model_max_length' attributes.
tokenizer: HuggingFace AutoTokenizer instance.
"""
def __len__(self) -> int:
"""Returns number of training examples (blocks)."""
def __getitem__(self, i) -> dict:
"""Returns dict(input_ids=Tensor, labels=Tensor) for block i."""
def encode(self, examples: dict) -> dict:
"""Tokenizes text examples using the tokenizer."""
def group_texts(self, examples: list) -> list:
"""Concatenates all token sequences and chunks into blocks of block_size."""
def process(self) -> list:
"""Loads dataset, tokenizes, and returns list of token ID tensors."""
Import
from dataset.pt_dataset import PTDataset
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | Arguments | Yes | Training arguments with dataset path and model_max_length |
| tokenizer | AutoTokenizer | Yes | HuggingFace tokenizer for encoding text |
Outputs
| Name | Type | Description |
|---|---|---|
| __getitem__ returns | dict | dict(input_ids=Tensor[block_size], labels=Tensor[block_size]) |
| input_ids | list[Tensor] | List of token ID tensors, each of length block_size |
| labels | list[Tensor] | Copy of input_ids (causal LM target) |
Usage Examples
from transformers import AutoTokenizer
from dataset.pt_dataset import PTDataset
# Setup
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
class Args:
dataset = "path/to/train.txt"
model_max_length = 2048
args = Args()
# Create dataset
dataset = PTDataset(args, tokenizer)
print(f"Number of training blocks: {len(dataset)}")
# Access a single example
example = dataset[0]
print(f"input_ids shape: {example['input_ids'].shape}")
print(f"labels shape: {example['labels'].shape}")