Principle:LLMBook zh LLMBook zh github io ALiBi Position Encoding

Knowledge Sources	Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation LLMBook-zh
Domains	Deep_Learning, Model_Architecture
Last Updated	2026-02-08 04:29 GMT

Overview

Position encoding method that adds a linear bias proportional to token distance directly to attention scores, enabling length extrapolation without positional embeddings.

Description

Attention with Linear Biases (ALiBi) replaces explicit positional embeddings with a simple linear penalty added to attention scores. Each attention head receives a head-specific slope, and the penalty for attending to a distant token is proportional to the distance multiplied by that slope. The slopes are geometrically spaced: for $n$ heads, the slopes are $2^{- 8 / n}, 2^{- 16 / n}, \dots$ . ALiBi requires no learned parameters for position encoding and naturally supports length extrapolation (training on short sequences, testing on long ones). It is used in models like BLOOM and MPT as an alternative to RoPE.

Usage

Use this principle when studying alternative position encoding strategies for Transformers, particularly for architectures that need strong length extrapolation. ALiBi is applied by constructing a bias tensor that is added to the attention score matrix before softmax. It is a direct alternative to RoPE and sinusoidal embeddings.

Theoretical Basis

ALiBi modifies the attention computation by adding a linear bias:

$Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + m \cdot [- (j - i)]_{i, j}) V$

Where:

$m$ is the head-specific slope
$[- (j - i)]_{i, j}$ is a matrix of relative distances (negative so closer tokens get higher scores)
Slopes are computed as: $m_{k} = 2^{- 8 k / n}$ for $k = 1, \dots, n$

When $n$ is not a power of 2, extra slopes are interpolated from $2^{- 4 k / n^{'}}$ .

Pseudo-code Logic:

# Abstract algorithm description (NOT real implementation)
slopes = pow(base, range(1, num_heads + 1))  # geometric slopes
distances = cumsum(attention_mask) - 1        # relative positions
alibi_bias = slopes * distances               # linear penalty
attention_scores = QK^T / sqrt(d_k) + alibi_bias

Related Pages

Implementation:LLMBook_zh_LLMBook_zh_github_io_Build_Alibi_Tensor

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment