Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Hiyouga LLaMA Factory Causal Language Model Pretraining

From Leeroopedia


Knowledge Sources
Domains Natural Language Processing, Deep Learning, Language Modeling
Last Updated 2026-02-06 19:00 GMT

Overview

A foundational training paradigm that teaches a language model to predict the next token in a sequence by training on large corpora of unstructured text using the causal (autoregressive) language modeling objective.

Description

Causal Language Model Pretraining (often abbreviated as PT) is the fundamental training stage that produces a base language model. The model learns to predict each token conditioned on all preceding tokens, building a rich internal representation of language structure, world knowledge, and reasoning patterns. This is the first stage in the typical LLM development pipeline, preceding supervised fine-tuning and alignment.

In the context of continued pretraining (also called domain-adaptive pretraining), this technique is used to further train an existing pretrained model on domain-specific data (e.g., medical texts, legal documents, code) to improve its knowledge and performance in that domain without starting from scratch.

Key characteristics include:

  • Autoregressive objective: Every token in the sequence contributes to the loss, unlike SFT where prompt tokens are masked.
  • No label masking: The standard language modeling data collator applies uniform loss across all positions.
  • Perplexity evaluation: Model quality is measured by perplexity, the exponentiation of the average cross-entropy loss.
  • Data efficiency: Continued pretraining enables efficient domain adaptation by leveraging the general knowledge already encoded in the base model.

Usage

Use causal language model pretraining when you want to:

  • Perform continued pretraining to inject domain-specific knowledge into an existing model.
  • Adapt a model to a new language or specialized vocabulary.
  • Train a model from scratch on a custom corpus (less common with LLaMA-Factory).
  • Improve a model's fluency and factual knowledge on domain-specific text before fine-tuning.

Pretraining is appropriate when you have large volumes of unlabeled text and want to build or enhance the model's general language capabilities before applying task-specific training.

Theoretical Basis

Autoregressive Language Modeling

The causal language model maximizes the log-likelihood of each token given its left context:

PT(θ)=t=1TlogPθ(xtx1,x2,,xt1)

where x1,x2,,xT is a sequence of tokens and θ are the model parameters. Unlike SFT, all tokens contribute to the loss since there is no distinction between prompt and response -- the entire text is treated as the training target.

Causal Attention Mask

The autoregressive property is enforced through a causal (lower-triangular) attention mask in the transformer's self-attention mechanism. For a sequence of length T, the attention mask M is:

Mij={0if jiif j>i

This ensures that the prediction of token xt can only attend to tokens x1,,xt1, preventing information leakage from future positions.

Perplexity

The primary evaluation metric for language model pretraining is perplexity, defined as:

PPL=exp(1Tt=1TlogPθ(xtx<t))

Perplexity measures how "surprised" the model is by the evaluation data. Lower perplexity indicates better prediction of the held-out text. A perplexity of k can be interpreted as the model being, on average, as uncertain as if it were choosing uniformly among k tokens at each position.

Data Collation

Pretraining uses a standard DataCollatorForLanguageModeling with masked language modeling disabled (mlm=False). This collator:

  • Pads sequences to equal length within a batch.
  • Shifts labels by one position so that the model predicts the next token at each position.
  • Does not apply any IGNORE_INDEX masking, ensuring all tokens contribute to the loss.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment