Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Attn Params

From Leeroopedia
Knowledge Sources
Domains Attention, Inference Parameters
Last Updated 2026-02-15 00:00 GMT

Overview

Params and PagedParams manage attention mask construction, positional encoding offsets, and sequence metadata required by every attention layer during forward passes through ExLlamaV2 models.

Description

The Params class is the standard attention-parameter container used during non-paged inference. It tracks the current batch_size, seq_len, past_len (or a list of past lengths when using multiple caches via multi_cache), an optional input_mask for non-causal attention patterns, position_offsets for batched sequences of differing lengths, and alt_rope_embed / rope_offsets for alternative rotary positional embedding schemes. The class lazily builds causal attention masks via build_single_attn_mask() and caches them for reuse across layers. When tensor parallelism is active, prep_tp() replicates position offsets and past-length tensors to every device in the KV split.

PagedParams extends Params with paged-attention metadata: a block_index tensor mapping virtual pages to physical cache blocks, cache_seqlens per batch element, max_cache_seqlen, and page_size. It also detects whether the block layout is sequential (contiguous physical pages) to enable optimised sequential-write paths. PagedParams overrides get_attn_mask() to raise NotImplementedError because paged attention kernels compute masking internally. It also provides its own prep_tp() that distributes block indices and cache sequence lengths across GPU devices.

Both classes also support block_diag_mask and cu_seqlens attributes used for flash-attention style block-diagonal masking (e.g. packing multiple short sequences into a single batch dimension).

Usage

Use Params when running standard (non-paged) inference or prefill, passing it to each layer's forward() call as the attn_params argument. Use PagedParams when the cache is managed with a paged-attention allocator (e.g. inside ExLlamaV2DynamicGenerator). Both classes are typically constructed once per forward pass and shared across all transformer layers.

Code Reference

Source Location

Signature

class Params:
    def __init__(
        self,
        batch_size: int = 1,
        seq_len: int | None = None,
        past_len: int | list[int] | None = None,
        input_mask: torch.Tensor | None = None,
        position_offsets: torch.Tensor | None = None,
        paged: bool = False,
        alt_rope_embed: tuple[torch.Tensor, torch.Tensor] | dict | None = None,
        non_causal_attn: bool = False,
        rope_offsets: torch.Tensor | None = None
    ): ...

class PagedParams(Params):
    def __init__(
        self,
        batch_size: int,
        block_index: torch.Tensor,
        cache_seqlens: torch.Tensor,
        max_cache_seqlen: int,
        page_size: int,
        q_len: int = 0,
        alt_rope_embed: dict | None = None,
        rope_offsets: torch.Tensor | None = None
    ): ...

Import

from exllamav2.attn_params import Params, PagedParams

I/O Contract

Inputs (Params.__init__)

Name Type Required Description
batch_size int No (default 1) Number of sequences in the batch
seq_len int or None No Length of the current input segment being processed
past_len int, list[int], or None No Number of previously cached tokens; pass a list when using multi-cache mode
input_mask torch.Tensor or None No Custom attention mask tensor; when set, is_causal() returns False
position_offsets torch.Tensor or None No Per-batch offsets added to position indices (used when sequences have different lengths)
paged bool No (default False) Set to True for paged attention mode (constructor returns early)
alt_rope_embed tuple or dict or None No Alternative RoPE sin/cos embeddings, either as a (sin, cos) tuple or a device-keyed dict
non_causal_attn bool No (default False) If True, disables causal masking entirely
rope_offsets torch.Tensor or None No Additional offsets applied to RoPE positions

Inputs (PagedParams.__init__)

Name Type Required Description
batch_size int Yes Number of sequences in the batch
block_index torch.Tensor Yes 2-D tensor (batch, max_pages) mapping virtual page indices to physical cache block indices; must be on CPU
cache_seqlens torch.Tensor Yes 1-D tensor (batch,) with the current cached sequence length per batch element; must be on CPU
max_cache_seqlen int Yes Maximum cache sequence length across the batch
page_size int Yes Number of tokens per page/block in the paged cache
q_len int No (default 0) Length of the query segment being appended; must be > 0
alt_rope_embed dict or None No Device-keyed dict of alternative RoPE embeddings
rope_offsets torch.Tensor or None No Additional offsets applied to RoPE positions; must be on CPU

Outputs

Name Type Description
is_causal() bool True when no custom input_mask is set (Params only)
get_position_offsets(device) torch.Tensor Position offsets tensor moved to the specified device
get_past_lens(device) torch.Tensor Past lengths as a tensor on the specified device (multi-cache mode)
get_attn_mask(device, force) torch.Tensor or None Lazily built causal + input mask; returns None when seq_len == 1 and no input_mask
build_single_attn_mask(...) torch.Tensor A (batch, 1, seq_len, past_len+seq_len) float16 upper-triangular mask
get_block_index(device) torch.Tensor Block index tensor moved to device (PagedParams only)
get_cache_seqlens(device_idx) torch.Tensor Cache sequence lengths on the specified device (PagedParams only)
is_sequential bool True if the pages for a single-batch element are physically contiguous (PagedParams only)

Usage Examples

Basic Usage (Standard Params)

from exllamav2.attn_params import Params

# Single-sequence inference with 512 cached tokens and 1 new token
attn_params = Params(
    batch_size=1,
    seq_len=1,
    past_len=512
)

# The mask is None for single-token causal decoding (optimised path)
mask = attn_params.get_attn_mask(device="cuda:0")
assert mask is None

# Check causality
assert attn_params.is_causal() is True

Batched Prefill with Input Mask

import torch
from exllamav2.attn_params import Params

# Batch of 4 sequences, each 128 tokens, no prior cache
input_mask = torch.zeros((4, 128), dtype=torch.float16)
attn_params = Params(
    batch_size=4,
    seq_len=128,
    past_len=0,
    input_mask=input_mask
)

# Causality is False because an input_mask was provided
assert attn_params.is_causal() is False
mask = attn_params.get_attn_mask(device="cuda:0")
# mask shape: (4, 1, 128, 128)

Paged Attention (PagedParams)

import torch
from exllamav2.attn_params import PagedParams

block_index = torch.tensor([[0, 1, 2, 3]], dtype=torch.int)   # 1 batch, 4 pages
cache_seqlens = torch.tensor([250], dtype=torch.int)           # 250 tokens cached

paged_params = PagedParams(
    batch_size=1,
    block_index=block_index,
    cache_seqlens=cache_seqlens,
    max_cache_seqlen=256,
    page_size=64,
    q_len=1
)

# Sequential detection for contiguous pages
print(paged_params.is_sequential)  # True (pages 0,1,2,3 are contiguous)

Key Methods

Params

Method Description
is_causal() Returns True when no custom input_mask is provided
get_position_offsets(device) Returns position_offsets tensor, lazily moving it to the target device
get_past_lens(device) Converts the past_lens list to a tensor on the target device (multi-cache mode)
get_attn_mask(device, force=False) Lazily builds and caches the attention mask; returns None for single-token causal decoding unless force=True
build_single_attn_mask(batch_size, seq_len, past_len, device, input_mask) Constructs a (batch, 1, seq_len, past_len+seq_len) causal mask with optional input_mask overlay
prep_tp(model) Replicates position_offsets and past_len tensors across all tensor-parallel devices
get_alt_rope_embed(device) Returns alternative RoPE (sin, cos) tuple for the given device, lazily copying from CPU
get_cu_seqlens(device) Returns cumulative sequence length tensor for flash-attention block-diagonal masking
get_block_diag_mask(device) Builds a block-diagonal attention mask from cu_seqlens for packed-sequence attention

PagedParams

Method Description
get_block_index(device) Returns block_index tensor on the specified device
get_cache_seqlens(device_idx) Returns cache_seqlens on the specified device
get_cache_seqlens_after(device_idx) Returns cache_seqlens + q_len (only valid when is_sequential is True)
prep_tp(model) Distributes block_index, cache_seqlens, and cache_seqlens_after across TP devices
get_attn_mask(device, force) Raises NotImplementedError; paged attention handles masking internally

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment