Implementation:LLMBook zh LLMBook zh github io Trainer Train Pretraining
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Training |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Concrete tool for executing the pre-training loop using HuggingFace Trainer provided by the Transformers library.
Description
Trainer.train() from HuggingFace Transformers executes the complete training loop for pre-training. In this repository, it is configured with PTDataset, AutoModelForCausalLM, and a custom Arguments dataclass extending TrainingArguments. Key settings include bf16 mixed precision, save_only_model mode, and a 2048-token context window.
This is a Wrapper Doc — it documents how the LLMBook repository uses the external HuggingFace Trainer API.
Usage
Use this after loading the model and preparing the dataset. Pass model, args, tokenizer, and train_dataset to Trainer, then call train().
Code Reference
Source Location
- Repository: LLMBook-zh
- File: code/6.2 预训练实践.py
- Lines: 43-67
Signature
# Repository-specific usage pattern
trainer = Trainer(
model=model, # AutoModelForCausalLM
args=args, # Arguments(TrainingArguments)
tokenizer=tokenizer, # AutoTokenizer
train_dataset=PTDataset(args, tokenizer), # Pre-training dataset
)
trainer.train() # Returns TrainOutput
Import
from transformers import Trainer, TrainingArguments, HfArgumentParser
External Reference
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | PreTrainedModel | Yes | The causal LM model to train |
| args | TrainingArguments | Yes | Training hyperparameters (bf16, lr, epochs, etc.) |
| tokenizer | AutoTokenizer | Yes | Tokenizer for the model |
| train_dataset | PTDataset | Yes | Pre-training dataset producing (input_ids, labels) |
Outputs
| Name | Type | Description |
|---|---|---|
| train() returns | TrainOutput | Contains global_step, training_loss, metrics |
Usage Examples
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, HfArgumentParser
from dataset.pt_dataset import PTDataset
# Parse arguments (from command line)
parser = HfArgumentParser(Arguments)
args = parser.parse_args_into_dataclasses()[0]
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
model = AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
attn_implementation="flash_attention_2"
)
# Train
trainer = Trainer(
model=model,
args=args,
tokenizer=tokenizer,
train_dataset=PTDataset(args, tokenizer),
)
trainer.train()
Related Pages
Implements Principle
Requires Environment
- Environment:LLMBook_zh_LLMBook_zh_github_io_PyTorch_CUDA_GPU_Environment
- Environment:LLMBook_zh_LLMBook_zh_github_io_HuggingFace_Transformers_Stack