Principle:LLMBook zh LLMBook zh github io Causal LM Loss Computation

Knowledge Sources	Language Models are Unsupervised Multitask Learners LLMBook-zh
Domains	NLP, Deep_Learning, Optimization
Last Updated	2026-02-08 00:00 GMT

Overview

The next-token prediction objective that trains causal language models by computing cross-entropy loss between predicted and actual token distributions.

Description

Causal LM Loss Computation implements the standard language modeling objective: predict each token given all preceding tokens. The model produces logits (unnormalized probabilities) for each position, and the loss is the cross-entropy between these logits and the actual next tokens. Internally, the logits are shifted so that position t predicts token t+1.

This is the fundamental training signal for all autoregressive language models including GPT, LLaMA, and other decoder-only architectures.

Usage

Use this principle whenever training an autoregressive language model. The loss computation is embedded in the model's forward pass — when labels are provided, the model automatically computes and returns the loss.

Theoretical Basis

Given a sequence of tokens $x_{1}, x_{2}, \dots, x_{T}$ , the causal language modeling loss is:

$ℒ = - \frac{1}{T - 1} \sum_{t = 1}^{T - 1} \log P (x_{t + 1} | x_{1}, \dots, x_{t})$

Implementation steps:

Pass input through the Transformer to get hidden states.
Project hidden states to vocabulary logits via a linear head.
Shift logits and labels: logits[:-1] predicts labels[1:].
Compute cross-entropy loss on the flattened tensors.

Related Pages

Implemented By

Implementation:LLMBook_zh_LLMBook_zh_github_io_LlamaForCausalLM_Forward

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment