Overview
Params and PagedParams manage attention mask construction, positional encoding offsets, and sequence metadata required by every attention layer during forward passes through ExLlamaV2 models.
Description
The Params class is the standard attention-parameter container used during non-paged inference. It tracks the current batch_size, seq_len, past_len (or a list of past lengths when using multiple caches via multi_cache), an optional input_mask for non-causal attention patterns, position_offsets for batched sequences of differing lengths, and alt_rope_embed / rope_offsets for alternative rotary positional embedding schemes. The class lazily builds causal attention masks via build_single_attn_mask() and caches them for reuse across layers. When tensor parallelism is active, prep_tp() replicates position offsets and past-length tensors to every device in the KV split.
PagedParams extends Params with paged-attention metadata: a block_index tensor mapping virtual pages to physical cache blocks, cache_seqlens per batch element, max_cache_seqlen, and page_size. It also detects whether the block layout is sequential (contiguous physical pages) to enable optimised sequential-write paths. PagedParams overrides get_attn_mask() to raise NotImplementedError because paged attention kernels compute masking internally. It also provides its own prep_tp() that distributes block indices and cache sequence lengths across GPU devices.
Both classes also support block_diag_mask and cu_seqlens attributes used for flash-attention style block-diagonal masking (e.g. packing multiple short sequences into a single batch dimension).
Usage
Use Params when running standard (non-paged) inference or prefill, passing it to each layer's forward() call as the attn_params argument. Use PagedParams when the cache is managed with a paged-attention allocator (e.g. inside ExLlamaV2DynamicGenerator). Both classes are typically constructed once per forward pass and shared across all transformer layers.
Code Reference
Source Location
Signature
class Params:
def __init__(
self,
batch_size: int = 1,
seq_len: int | None = None,
past_len: int | list[int] | None = None,
input_mask: torch.Tensor | None = None,
position_offsets: torch.Tensor | None = None,
paged: bool = False,
alt_rope_embed: tuple[torch.Tensor, torch.Tensor] | dict | None = None,
non_causal_attn: bool = False,
rope_offsets: torch.Tensor | None = None
): ...
class PagedParams(Params):
def __init__(
self,
batch_size: int,
block_index: torch.Tensor,
cache_seqlens: torch.Tensor,
max_cache_seqlen: int,
page_size: int,
q_len: int = 0,
alt_rope_embed: dict | None = None,
rope_offsets: torch.Tensor | None = None
): ...
Import
from exllamav2.attn_params import Params, PagedParams
I/O Contract
Inputs (Params.__init__)
| Name |
Type |
Required |
Description
|
| batch_size |
int |
No (default 1) |
Number of sequences in the batch
|
| seq_len |
int or None |
No |
Length of the current input segment being processed
|
| past_len |
int, list[int], or None |
No |
Number of previously cached tokens; pass a list when using multi-cache mode
|
| input_mask |
torch.Tensor or None |
No |
Custom attention mask tensor; when set, is_causal() returns False
|
| position_offsets |
torch.Tensor or None |
No |
Per-batch offsets added to position indices (used when sequences have different lengths)
|
| paged |
bool |
No (default False) |
Set to True for paged attention mode (constructor returns early)
|
| alt_rope_embed |
tuple or dict or None |
No |
Alternative RoPE sin/cos embeddings, either as a (sin, cos) tuple or a device-keyed dict
|
| non_causal_attn |
bool |
No (default False) |
If True, disables causal masking entirely
|
| rope_offsets |
torch.Tensor or None |
No |
Additional offsets applied to RoPE positions
|
Inputs (PagedParams.__init__)
| Name |
Type |
Required |
Description
|
| batch_size |
int |
Yes |
Number of sequences in the batch
|
| block_index |
torch.Tensor |
Yes |
2-D tensor (batch, max_pages) mapping virtual page indices to physical cache block indices; must be on CPU
|
| cache_seqlens |
torch.Tensor |
Yes |
1-D tensor (batch,) with the current cached sequence length per batch element; must be on CPU
|
| max_cache_seqlen |
int |
Yes |
Maximum cache sequence length across the batch
|
| page_size |
int |
Yes |
Number of tokens per page/block in the paged cache
|
| q_len |
int |
No (default 0) |
Length of the query segment being appended; must be > 0
|
| alt_rope_embed |
dict or None |
No |
Device-keyed dict of alternative RoPE embeddings
|
| rope_offsets |
torch.Tensor or None |
No |
Additional offsets applied to RoPE positions; must be on CPU
|
Outputs
| Name |
Type |
Description
|
| is_causal() |
bool |
True when no custom input_mask is set (Params only)
|
| get_position_offsets(device) |
torch.Tensor |
Position offsets tensor moved to the specified device
|
| get_past_lens(device) |
torch.Tensor |
Past lengths as a tensor on the specified device (multi-cache mode)
|
| get_attn_mask(device, force) |
torch.Tensor or None |
Lazily built causal + input mask; returns None when seq_len == 1 and no input_mask
|
| build_single_attn_mask(...) |
torch.Tensor |
A (batch, 1, seq_len, past_len+seq_len) float16 upper-triangular mask
|
| get_block_index(device) |
torch.Tensor |
Block index tensor moved to device (PagedParams only)
|
| get_cache_seqlens(device_idx) |
torch.Tensor |
Cache sequence lengths on the specified device (PagedParams only)
|
| is_sequential |
bool |
True if the pages for a single-batch element are physically contiguous (PagedParams only)
|
Usage Examples
Basic Usage (Standard Params)
from exllamav2.attn_params import Params
# Single-sequence inference with 512 cached tokens and 1 new token
attn_params = Params(
batch_size=1,
seq_len=1,
past_len=512
)
# The mask is None for single-token causal decoding (optimised path)
mask = attn_params.get_attn_mask(device="cuda:0")
assert mask is None
# Check causality
assert attn_params.is_causal() is True
Batched Prefill with Input Mask
import torch
from exllamav2.attn_params import Params
# Batch of 4 sequences, each 128 tokens, no prior cache
input_mask = torch.zeros((4, 128), dtype=torch.float16)
attn_params = Params(
batch_size=4,
seq_len=128,
past_len=0,
input_mask=input_mask
)
# Causality is False because an input_mask was provided
assert attn_params.is_causal() is False
mask = attn_params.get_attn_mask(device="cuda:0")
# mask shape: (4, 1, 128, 128)
Paged Attention (PagedParams)
import torch
from exllamav2.attn_params import PagedParams
block_index = torch.tensor([[0, 1, 2, 3]], dtype=torch.int) # 1 batch, 4 pages
cache_seqlens = torch.tensor([250], dtype=torch.int) # 250 tokens cached
paged_params = PagedParams(
batch_size=1,
block_index=block_index,
cache_seqlens=cache_seqlens,
max_cache_seqlen=256,
page_size=64,
q_len=1
)
# Sequential detection for contiguous pages
print(paged_params.is_sequential) # True (pages 0,1,2,3 are contiguous)
Key Methods
Params
| Method |
Description
|
| is_causal() |
Returns True when no custom input_mask is provided
|
| get_position_offsets(device) |
Returns position_offsets tensor, lazily moving it to the target device
|
| get_past_lens(device) |
Converts the past_lens list to a tensor on the target device (multi-cache mode)
|
| get_attn_mask(device, force=False) |
Lazily builds and caches the attention mask; returns None for single-token causal decoding unless force=True
|
| build_single_attn_mask(batch_size, seq_len, past_len, device, input_mask) |
Constructs a (batch, 1, seq_len, past_len+seq_len) causal mask with optional input_mask overlay
|
| prep_tp(model) |
Replicates position_offsets and past_len tensors across all tensor-parallel devices
|
| get_alt_rope_embed(device) |
Returns alternative RoPE (sin, cos) tuple for the given device, lazily copying from CPU
|
| get_cu_seqlens(device) |
Returns cumulative sequence length tensor for flash-attention block-diagonal masking
|
| get_block_diag_mask(device) |
Builds a block-diagonal attention mask from cu_seqlens for packed-sequence attention
|
PagedParams
| Method |
Description
|
| get_block_index(device) |
Returns block_index tensor on the specified device
|
| get_cache_seqlens(device_idx) |
Returns cache_seqlens on the specified device
|
| get_cache_seqlens_after(device_idx) |
Returns cache_seqlens + q_len (only valid when is_sequential is True)
|
| prep_tp(model) |
Distributes block_index, cache_seqlens, and cache_seqlens_after across TP devices
|
| get_attn_mask(device, force) |
Raises NotImplementedError; paged attention handles masking internally
|
Related Pages