Principle:LLMBook zh LLMBook zh github io ALiBi Position Encoding
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Model_Architecture |
| Last Updated | 2026-02-08 04:29 GMT |
Overview
Position encoding method that adds a linear bias proportional to token distance directly to attention scores, enabling length extrapolation without positional embeddings.
Description
Attention with Linear Biases (ALiBi) replaces explicit positional embeddings with a simple linear penalty added to attention scores. Each attention head receives a head-specific slope, and the penalty for attending to a distant token is proportional to the distance multiplied by that slope. The slopes are geometrically spaced: for heads, the slopes are . ALiBi requires no learned parameters for position encoding and naturally supports length extrapolation (training on short sequences, testing on long ones). It is used in models like BLOOM and MPT as an alternative to RoPE.
Usage
Use this principle when studying alternative position encoding strategies for Transformers, particularly for architectures that need strong length extrapolation. ALiBi is applied by constructing a bias tensor that is added to the attention score matrix before softmax. It is a direct alternative to RoPE and sinusoidal embeddings.
Theoretical Basis
ALiBi modifies the attention computation by adding a linear bias:
Where:
- is the head-specific slope
- is a matrix of relative distances (negative so closer tokens get higher scores)
- Slopes are computed as: for
When is not a power of 2, extra slopes are interpolated from .
Pseudo-code Logic:
# Abstract algorithm description (NOT real implementation)
slopes = pow(base, range(1, num_heads + 1)) # geometric slopes
distances = cumsum(attention_mask) - 1 # relative positions
alibi_bias = slopes * distances # linear penalty
attention_scores = QK^T / sqrt(d_k) + alibi_bias