Implementation:Turboderp org Exllamav2 Attn Params

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Attention, Inference Parameters
Last Updated	2026-02-15 00:00 GMT

Overview

Params and PagedParams manage attention mask construction, positional encoding offsets, and sequence metadata required by every attention layer during forward passes through ExLlamaV2 models.

Description

The Params class is the standard attention-parameter container used during non-paged inference. It tracks the current batch_size, seq_len, past_len (or a list of past lengths when using multiple caches via multi_cache), an optional input_mask for non-causal attention patterns, position_offsets for batched sequences of differing lengths, and alt_rope_embed / rope_offsets for alternative rotary positional embedding schemes. The class lazily builds causal attention masks via build_single_attn_mask() and caches them for reuse across layers. When tensor parallelism is active, prep_tp() replicates position offsets and past-length tensors to every device in the KV split.

PagedParams extends Params with paged-attention metadata: a block_index tensor mapping virtual pages to physical cache blocks, cache_seqlens per batch element, max_cache_seqlen, and page_size. It also detects whether the block layout is sequential (contiguous physical pages) to enable optimised sequential-write paths. PagedParams overrides get_attn_mask() to raise NotImplementedError because paged attention kernels compute masking internally. It also provides its own prep_tp() that distributes block indices and cache sequence lengths across GPU devices.

Both classes also support block_diag_mask and cu_seqlens attributes used for flash-attention style block-diagonal masking (e.g. packing multiple short sequences into a single batch dimension).

Usage

Use Params when running standard (non-paged) inference or prefill, passing it to each layer's forward() call as the attn_params argument. Use PagedParams when the cache is managed with a paged-attention allocator (e.g. inside ExLlamaV2DynamicGenerator). Both classes are typically constructed once per forward pass and shared across all transformer layers.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/attn_params.py
Lines: Params 6-201, PagedParams 203-317

Signature

class Params:
    def __init__(
        self,
        batch_size: int = 1,
        seq_len: int | None = None,
        past_len: int | list[int] | None = None,
        input_mask: torch.Tensor | None = None,
        position_offsets: torch.Tensor | None = None,
        paged: bool = False,
        alt_rope_embed: tuple[torch.Tensor, torch.Tensor] | dict | None = None,
        non_causal_attn: bool = False,
        rope_offsets: torch.Tensor | None = None
    ): ...

class PagedParams(Params):
    def __init__(
        self,
        batch_size: int,
        block_index: torch.Tensor,
        cache_seqlens: torch.Tensor,
        max_cache_seqlen: int,
        page_size: int,
        q_len: int = 0,
        alt_rope_embed: dict | None = None,
        rope_offsets: torch.Tensor | None = None
    ): ...

Import

from exllamav2.attn_params import Params, PagedParams

I/O Contract

Inputs (Params.init)

Name	Type	Required	Description
batch_size	int	No (default 1)	Number of sequences in the batch
seq_len	int or None	No	Length of the current input segment being processed
past_len	int, list[int], or None	No	Number of previously cached tokens; pass a list when using multi-cache mode
input_mask	torch.Tensor or None	No	Custom attention mask tensor; when set, is_causal() returns False
position_offsets	torch.Tensor or None	No	Per-batch offsets added to position indices (used when sequences have different lengths)
paged	bool	No (default False)	Set to True for paged attention mode (constructor returns early)
alt_rope_embed	tuple or dict or None	No	Alternative RoPE sin/cos embeddings, either as a (sin, cos) tuple or a device-keyed dict
non_causal_attn	bool	No (default False)	If True, disables causal masking entirely
rope_offsets	torch.Tensor or None	No	Additional offsets applied to RoPE positions

Inputs (PagedParams.init)

Name	Type	Required	Description
batch_size	int	Yes	Number of sequences in the batch
block_index	torch.Tensor	Yes	2-D tensor (batch, max_pages) mapping virtual page indices to physical cache block indices; must be on CPU
cache_seqlens	torch.Tensor	Yes	1-D tensor (batch,) with the current cached sequence length per batch element; must be on CPU
max_cache_seqlen	int	Yes	Maximum cache sequence length across the batch
page_size	int	Yes	Number of tokens per page/block in the paged cache
q_len	int	No (default 0)	Length of the query segment being appended; must be > 0
alt_rope_embed	dict or None	No	Device-keyed dict of alternative RoPE embeddings
rope_offsets	torch.Tensor or None	No	Additional offsets applied to RoPE positions; must be on CPU

Outputs

Name	Type	Description
is_causal()	bool	True when no custom input_mask is set (Params only)
get_position_offsets(device)	torch.Tensor	Position offsets tensor moved to the specified device
get_past_lens(device)	torch.Tensor	Past lengths as a tensor on the specified device (multi-cache mode)
get_attn_mask(device, force)	torch.Tensor or None	Lazily built causal + input mask; returns None when seq_len == 1 and no input_mask
build_single_attn_mask(...)	torch.Tensor	A (batch, 1, seq_len, past_len+seq_len) float16 upper-triangular mask
get_block_index(device)	torch.Tensor	Block index tensor moved to device (PagedParams only)
get_cache_seqlens(device_idx)	torch.Tensor	Cache sequence lengths on the specified device (PagedParams only)
is_sequential	bool	True if the pages for a single-batch element are physically contiguous (PagedParams only)

Usage Examples

Basic Usage (Standard Params)

from exllamav2.attn_params import Params

# Single-sequence inference with 512 cached tokens and 1 new token
attn_params = Params(
    batch_size=1,
    seq_len=1,
    past_len=512
)

# The mask is None for single-token causal decoding (optimised path)
mask = attn_params.get_attn_mask(device="cuda:0")
assert mask is None

# Check causality
assert attn_params.is_causal() is True

Batched Prefill with Input Mask

import torch
from exllamav2.attn_params import Params

# Batch of 4 sequences, each 128 tokens, no prior cache
input_mask = torch.zeros((4, 128), dtype=torch.float16)
attn_params = Params(
    batch_size=4,
    seq_len=128,
    past_len=0,
    input_mask=input_mask
)

# Causality is False because an input_mask was provided
assert attn_params.is_causal() is False
mask = attn_params.get_attn_mask(device="cuda:0")
# mask shape: (4, 1, 128, 128)

Paged Attention (PagedParams)

import torch
from exllamav2.attn_params import PagedParams

block_index = torch.tensor([[0, 1, 2, 3]], dtype=torch.int)   # 1 batch, 4 pages
cache_seqlens = torch.tensor([250], dtype=torch.int)           # 250 tokens cached

paged_params = PagedParams(
    batch_size=1,
    block_index=block_index,
    cache_seqlens=cache_seqlens,
    max_cache_seqlen=256,
    page_size=64,
    q_len=1
)

# Sequential detection for contiguous pages
print(paged_params.is_sequential)  # True (pages 0,1,2,3 are contiguous)

Key Methods

Params

Method	Description
is_causal()	Returns True when no custom input_mask is provided
get_position_offsets(device)	Returns position_offsets tensor, lazily moving it to the target device
get_past_lens(device)	Converts the past_lens list to a tensor on the target device (multi-cache mode)
get_attn_mask(device, force=False)	Lazily builds and caches the attention mask; returns None for single-token causal decoding unless force=True
build_single_attn_mask(batch_size, seq_len, past_len, device, input_mask)	Constructs a (batch, 1, seq_len, past_len+seq_len) causal mask with optional input_mask overlay
prep_tp(model)	Replicates position_offsets and past_len tensors across all tensor-parallel devices
get_alt_rope_embed(device)	Returns alternative RoPE (sin, cos) tuple for the given device, lazily copying from CPU
get_cu_seqlens(device)	Returns cumulative sequence length tensor for flash-attention block-diagonal masking
get_block_diag_mask(device)	Builds a block-diagonal attention mask from cu_seqlens for packed-sequence attention

PagedParams

Method	Description
get_block_index(device)	Returns block_index tensor on the specified device
get_cache_seqlens(device_idx)	Returns cache_seqlens on the specified device
get_cache_seqlens_after(device_idx)	Returns cache_seqlens + q_len (only valid when is_sequential is True)
prep_tp(model)	Distributes block_index, cache_seqlens, and cache_seqlens_after across TP devices
get_attn_mask(device, force)	Raises NotImplementedError; paged attention handles masking internally

Related Pages

Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment