Principle:LLMBook zh LLMBook zh github io Causal LM Loss Computation
| Knowledge Sources | |
|---|---|
| Domains | NLP, Deep_Learning, Optimization |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
The next-token prediction objective that trains causal language models by computing cross-entropy loss between predicted and actual token distributions.
Description
Causal LM Loss Computation implements the standard language modeling objective: predict each token given all preceding tokens. The model produces logits (unnormalized probabilities) for each position, and the loss is the cross-entropy between these logits and the actual next tokens. Internally, the logits are shifted so that position t predicts token t+1.
This is the fundamental training signal for all autoregressive language models including GPT, LLaMA, and other decoder-only architectures.
Usage
Use this principle whenever training an autoregressive language model. The loss computation is embedded in the model's forward pass — when labels are provided, the model automatically computes and returns the loss.
Theoretical Basis
Given a sequence of tokens , the causal language modeling loss is:
Implementation steps:
- Pass input through the Transformer to get hidden states.
- Project hidden states to vocabulary logits via a linear head.
- Shift logits and labels: logits[:-1] predicts labels[1:].
- Compute cross-entropy loss on the flattened tensors.