Workflow:LLMBook zh LLMBook zh github io LLM Pretraining
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Pre_Training, Deep_Learning |
| Last Updated | 2026-02-08 04:30 GMT |
Overview
End-to-end pre-training workflow for causal language models using the HuggingFace Trainer with LLaMA-style architecture, from dataset preparation through next-token prediction training to checkpoint saving.
Description
This workflow covers the complete process of pre-training a large language model from scratch (or continuing pre-training from an existing checkpoint). It begins with preparing a text dataset by tokenizing, concatenating, and chunking sequences into fixed-length blocks. The model uses a LLaMA-style decoder-only architecture with RMSNorm, Rotary Position Embeddings (RoPE), and a causal language modeling head. Training uses the standard next-token prediction objective with cross-entropy loss, leveraging the HuggingFace Trainer for distributed training orchestration, mixed-precision (BF16), and FlashAttention acceleration.
Usage
Execute this workflow when you need to pre-train a language model on a large text corpus, either from random initialization or by continuing training from an existing base model checkpoint. This is the foundational training stage that produces a base model capable of text generation, which can subsequently be fine-tuned for specific tasks.
Execution Steps
Step 1: Dataset Preparation
Load raw text data and convert it into training-ready token sequences. The text is tokenized using the model's tokenizer, then all token sequences are concatenated end-to-end. The concatenated stream is chunked into fixed-length blocks matching the model's context window size (e.g., 2048 tokens). Labels are set equal to input IDs since pre-training uses a self-supervised next-token prediction objective.
Key considerations:
- Sequences shorter than the block size are concatenated to avoid wasting compute on padding
- The chunking operation discards any trailing tokens that do not fill a complete block
- Each training example contains both input_ids and labels (identical for causal LM training)
Step 2: Model Initialization
Load or initialize the causal language model with the target architecture. The model follows the LLaMA design pattern: a stack of decoder layers each containing RMSNorm pre-normalization, multi-head self-attention with RoPE, residual connections, and a SwiGLU-based feedforward network. A linear language modeling head maps final hidden states to vocabulary logits. FlashAttention is enabled for memory-efficient attention computation.
Key considerations:
- RMSNorm replaces LayerNorm for more stable training at scale
- RoPE provides relative position encoding without learned position embeddings
- FlashAttention reduces memory usage from O(n squared) to O(n) for attention computation
- The model can be loaded from an existing checkpoint or initialized randomly
Step 3: Loss Computation
Compute the causal language modeling loss using next-token prediction. For each position t in the sequence, the model predicts the token at position t+1. Logits are shifted by one position so that predictions at positions 1 through T-1 are compared against ground truth tokens at positions 2 through T. Cross-entropy loss is computed over the flattened token predictions.
Key considerations:
- The shift operation ensures causal (left-to-right) prediction
- Tokens from all positions contribute equally to the loss
- The loss function operates on flattened tensors for computational efficiency
Step 4: Training Loop
Configure and execute the training loop using HuggingFace Trainer. Training arguments specify learning rate scheduling, batch size, gradient accumulation, BF16 mixed precision, and checkpointing strategy. The Trainer handles distributed training across multiple GPUs, gradient scaling, and optimizer state management automatically.
Key considerations:
- BF16 mixed precision reduces memory usage and accelerates training on modern GPUs
- Save only model parameters (not optimizer state) to reduce checkpoint size
- The final checkpoint is saved explicitly after training completes
Step 5: Checkpoint Saving
Save the trained model weights and training state to disk. The model checkpoint contains all learned parameters and can be loaded for inference or further fine-tuning. The tokenizer is saved alongside the model to ensure consistent tokenization when the model is reused.
Key considerations:
- Both the model and the trainer state are saved for potential training resumption
- The output directory structure follows HuggingFace conventions for easy model sharing