Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:LLMBook zh LLMBook zh github io Trainer Train Pretraining

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Training
Last Updated 2026-02-08 00:00 GMT

Overview

Concrete tool for executing the pre-training loop using HuggingFace Trainer provided by the Transformers library.

Description

Trainer.train() from HuggingFace Transformers executes the complete training loop for pre-training. In this repository, it is configured with PTDataset, AutoModelForCausalLM, and a custom Arguments dataclass extending TrainingArguments. Key settings include bf16 mixed precision, save_only_model mode, and a 2048-token context window.

This is a Wrapper Doc — it documents how the LLMBook repository uses the external HuggingFace Trainer API.

Usage

Use this after loading the model and preparing the dataset. Pass model, args, tokenizer, and train_dataset to Trainer, then call train().

Code Reference

Source Location

  • Repository: LLMBook-zh
  • File: code/6.2 预训练实践.py
  • Lines: 43-67

Signature

# Repository-specific usage pattern
trainer = Trainer(
    model=model,                          # AutoModelForCausalLM
    args=args,                            # Arguments(TrainingArguments)
    tokenizer=tokenizer,                  # AutoTokenizer
    train_dataset=PTDataset(args, tokenizer),  # Pre-training dataset
)
trainer.train()  # Returns TrainOutput

Import

from transformers import Trainer, TrainingArguments, HfArgumentParser

External Reference

I/O Contract

Inputs

Name Type Required Description
model PreTrainedModel Yes The causal LM model to train
args TrainingArguments Yes Training hyperparameters (bf16, lr, epochs, etc.)
tokenizer AutoTokenizer Yes Tokenizer for the model
train_dataset PTDataset Yes Pre-training dataset producing (input_ids, labels)

Outputs

Name Type Description
train() returns TrainOutput Contains global_step, training_loss, metrics

Usage Examples

from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, HfArgumentParser
from dataset.pt_dataset import PTDataset

# Parse arguments (from command line)
parser = HfArgumentParser(Arguments)
args = parser.parse_args_into_dataclasses()[0]

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
    args.model_name_or_path,
    attn_implementation="flash_attention_2"
)

# Train
trainer = Trainer(
    model=model,
    args=args,
    tokenizer=tokenizer,
    train_dataset=PTDataset(args, tokenizer),
)
trainer.train()

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment