Implementation:LLMBook zh LLMBook zh github io Trainer Train Pretraining

Knowledge Sources	LLMBook-zh HuggingFace Trainer
Domains	Deep_Learning, Training
Last Updated	2026-02-08 00:00 GMT

Overview

Concrete tool for executing the pre-training loop using HuggingFace Trainer provided by the Transformers library.

Description

Trainer.train() from HuggingFace Transformers executes the complete training loop for pre-training. In this repository, it is configured with PTDataset, AutoModelForCausalLM, and a custom Arguments dataclass extending TrainingArguments. Key settings include bf16 mixed precision, save_only_model mode, and a 2048-token context window.

This is a Wrapper Doc — it documents how the LLMBook repository uses the external HuggingFace Trainer API.

Usage

Use this after loading the model and preparing the dataset. Pass model, args, tokenizer, and train_dataset to Trainer, then call train().

Code Reference

Source Location

Repository: LLMBook-zh
File: code/6.2 预训练实践.py
Lines: 43-67

Signature

# Repository-specific usage pattern
trainer = Trainer(
    model=model,                          # AutoModelForCausalLM
    args=args,                            # Arguments(TrainingArguments)
    tokenizer=tokenizer,                  # AutoTokenizer
    train_dataset=PTDataset(args, tokenizer),  # Pre-training dataset
)
trainer.train()  # Returns TrainOutput

Import

from transformers import Trainer, TrainingArguments, HfArgumentParser

External Reference

HuggingFace Trainer Documentation

I/O Contract

Inputs

Name	Type	Required	Description
model	PreTrainedModel	Yes	The causal LM model to train
args	TrainingArguments	Yes	Training hyperparameters (bf16, lr, epochs, etc.)
tokenizer	AutoTokenizer	Yes	Tokenizer for the model
train_dataset	PTDataset	Yes	Pre-training dataset producing (input_ids, labels)

Outputs

Name	Type	Description
train() returns	TrainOutput	Contains global_step, training_loss, metrics

Usage Examples

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, HfArgumentParser
from dataset.pt_dataset import PTDataset

# Parse arguments (from command line)
parser = HfArgumentParser(Arguments)
args = parser.parse_args_into_dataclasses()[0]

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    attn_implementation="flash_attention_2"
)

# Train
trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    train_dataset=PTDataset(args, tokenizer),
)
trainer.train()

Related Pages

Implements Principle

Principle:LLMBook_zh_LLMBook_zh_github_io_Training_Loop_Execution

Requires Environment

Uses Heuristic

Heuristic:LLMBook_zh_LLMBook_zh_github_io_BF16_Mixed_Precision_Default

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment