Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:LLMBook zh LLMBook zh github io LLM Pretraining

From Leeroopedia


Knowledge Sources
Domains LLMs, Pre_Training, Deep_Learning
Last Updated 2026-02-08 04:30 GMT

Overview

End-to-end pre-training workflow for causal language models using the HuggingFace Trainer with LLaMA-style architecture, from dataset preparation through next-token prediction training to checkpoint saving.

Description

This workflow covers the complete process of pre-training a large language model from scratch (or continuing pre-training from an existing checkpoint). It begins with preparing a text dataset by tokenizing, concatenating, and chunking sequences into fixed-length blocks. The model uses a LLaMA-style decoder-only architecture with RMSNorm, Rotary Position Embeddings (RoPE), and a causal language modeling head. Training uses the standard next-token prediction objective with cross-entropy loss, leveraging the HuggingFace Trainer for distributed training orchestration, mixed-precision (BF16), and FlashAttention acceleration.

Usage

Execute this workflow when you need to pre-train a language model on a large text corpus, either from random initialization or by continuing training from an existing base model checkpoint. This is the foundational training stage that produces a base model capable of text generation, which can subsequently be fine-tuned for specific tasks.

Execution Steps

Step 1: Dataset Preparation

Load raw text data and convert it into training-ready token sequences. The text is tokenized using the model's tokenizer, then all token sequences are concatenated end-to-end. The concatenated stream is chunked into fixed-length blocks matching the model's context window size (e.g., 2048 tokens). Labels are set equal to input IDs since pre-training uses a self-supervised next-token prediction objective.

Key considerations:

  • Sequences shorter than the block size are concatenated to avoid wasting compute on padding
  • The chunking operation discards any trailing tokens that do not fill a complete block
  • Each training example contains both input_ids and labels (identical for causal LM training)

Step 2: Model Initialization

Load or initialize the causal language model with the target architecture. The model follows the LLaMA design pattern: a stack of decoder layers each containing RMSNorm pre-normalization, multi-head self-attention with RoPE, residual connections, and a SwiGLU-based feedforward network. A linear language modeling head maps final hidden states to vocabulary logits. FlashAttention is enabled for memory-efficient attention computation.

Key considerations:

  • RMSNorm replaces LayerNorm for more stable training at scale
  • RoPE provides relative position encoding without learned position embeddings
  • FlashAttention reduces memory usage from O(n squared) to O(n) for attention computation
  • The model can be loaded from an existing checkpoint or initialized randomly

Step 3: Loss Computation

Compute the causal language modeling loss using next-token prediction. For each position t in the sequence, the model predicts the token at position t+1. Logits are shifted by one position so that predictions at positions 1 through T-1 are compared against ground truth tokens at positions 2 through T. Cross-entropy loss is computed over the flattened token predictions.

Key considerations:

  • The shift operation ensures causal (left-to-right) prediction
  • Tokens from all positions contribute equally to the loss
  • The loss function operates on flattened tensors for computational efficiency

Step 4: Training Loop

Configure and execute the training loop using HuggingFace Trainer. Training arguments specify learning rate scheduling, batch size, gradient accumulation, BF16 mixed precision, and checkpointing strategy. The Trainer handles distributed training across multiple GPUs, gradient scaling, and optimizer state management automatically.

Key considerations:

  • BF16 mixed precision reduces memory usage and accelerates training on modern GPUs
  • Save only model parameters (not optimizer state) to reduce checkpoint size
  • The final checkpoint is saved explicitly after training completes

Step 5: Checkpoint Saving

Save the trained model weights and training state to disk. The model checkpoint contains all learned parameters and can be loaded for inference or further fine-tuning. The tokenizer is saved alongside the model to ensure consistent tokenization when the model is reused.

Key considerations:

  • Both the model and the trainer state are saved for potential training resumption
  • The output directory structure follows HuggingFace conventions for easy model sharing

Execution Diagram

GitHub URL

Workflow Repository