Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:LLMBook zh LLMBook zh github io ALiBi Position Encoding

From Leeroopedia


Knowledge Sources
Domains Deep_Learning, Model_Architecture
Last Updated 2026-02-08 04:29 GMT

Overview

Position encoding method that adds a linear bias proportional to token distance directly to attention scores, enabling length extrapolation without positional embeddings.

Description

Attention with Linear Biases (ALiBi) replaces explicit positional embeddings with a simple linear penalty added to attention scores. Each attention head receives a head-specific slope, and the penalty for attending to a distant token is proportional to the distance multiplied by that slope. The slopes are geometrically spaced: for n heads, the slopes are 28/n,216/n,. ALiBi requires no learned parameters for position encoding and naturally supports length extrapolation (training on short sequences, testing on long ones). It is used in models like BLOOM and MPT as an alternative to RoPE.

Usage

Use this principle when studying alternative position encoding strategies for Transformers, particularly for architectures that need strong length extrapolation. ALiBi is applied by constructing a bias tensor that is added to the attention score matrix before softmax. It is a direct alternative to RoPE and sinusoidal embeddings.

Theoretical Basis

ALiBi modifies the attention computation by adding a linear bias:

Attention(Q,K,V)=softmax(QKTdk+m[(ji)]i,j)V

Where:

  • m is the head-specific slope
  • [(ji)]i,j is a matrix of relative distances (negative so closer tokens get higher scores)
  • Slopes are computed as: mk=28k/n for k=1,,n

When n is not a power of 2, extra slopes are interpolated from 24k/n.

Pseudo-code Logic:

# Abstract algorithm description (NOT real implementation)
slopes = pow(base, range(1, num_heads + 1))  # geometric slopes
distances = cumsum(attention_mask) - 1        # relative positions
alibi_bias = slopes * distances               # linear penalty
attention_scores = QK^T / sqrt(d_k) + alibi_bias

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment