Implementation:Turboderp org Exllamav2 ExLlamaV2DeviceContext
| Knowledge Sources | |
|---|---|
| Domains | Device_Management, CUDA |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Per-device CUDA context manager that owns the scratch memory buffer, precomputed RoPE sine/cosine tables, and CUDA stream for a single GPU within an ExLlamaV2 multi-device deployment.
Description
ExLlamaV2DeviceContext encapsulates all device-specific resources needed by a single GPU during ExLlamaV2 inference. Each GPU in a model split receives its own context instance.
Key components:
- __init__(model, device_idx, scratch_bytes, archparams) -- Initialises the context for a given CUDA device index. Creates a CUDA stream (with priority -100) if one does not already exist for that device, storing it in the module-level global_streams dictionary.
- prepare(scratch) -- Calls prepare_sincos() to precompute RoPE tables and, if scratch is True, allocates the scratch buffer as a half-precision tensor of scratch_bytes // 2 elements.
- drop() -- Releases scratch memory, RoPE tables, and marks the context as not ready.
- free() -- Calls drop() and resets scratch_bytes to 1, effectively disabling future allocations.
- begin_scratch_alloc() -- Resets the scratch allocation pointer (scratch_idx) to 0, preparing for a new round of scratch slice requests.
- get_scratch_slice(size_bytes) -- Returns a narrow view into the scratch buffer of the requested size (64-byte aligned). Advances the allocation pointer. Lazily calls prepare(True) if scratch has not yet been allocated.
- prepare_sincos() -- Precomputes sine and cosine embedding tables for Rotary Position Embeddings (RoPE). Supports both NeoX-style (concatenated) and GPT-J-style (interleaved) RoPE layouts. Handles dual-theta configurations (rotary_embedding_base_alt) and applies position scaling factors.
Global functions:
- set_device_streams() -- Iterates over global_streams and sets each as the active CUDA stream for its device.
- get_device_stream(index) -- Returns the CUDA stream for a given device index, or None if not registered.
Usage
Use ExLlamaV2DeviceContext during model loading and inference to manage per-GPU resources. It is created internally by ExLlamaV2 (the model class) for each GPU in the device map and should not normally be instantiated directly by users.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/device.py
- Lines: L1-170
Signature
global_streams: dict = {}
def set_device_streams() -> None:
...
def get_device_stream(index: int) -> torch.cuda.Stream | None:
...
class ExLlamaV2DeviceContext:
model: ExLlamaV2
device_idx: int
ready: bool
scratch_bytes: int
scratch_idx: int
sin: list[torch.Tensor] | None
cos: list[torch.Tensor] | None
scratch: torch.Tensor | None
stream: torch.cuda.Stream
def __init__(
self,
model: ExLlamaV2,
device_idx: int,
scratch_bytes: int,
archparams=None
):
...
def prepare(self, scratch: bool) -> None:
...
def drop(self) -> None:
...
def free(self) -> None:
...
def begin_scratch_alloc(self) -> None:
...
def get_scratch_slice(self, size_bytes: int) -> torch.Tensor:
...
def prepare_sincos(self) -> None:
...
Import
from exllamav2.device import ExLlamaV2DeviceContext, get_device_stream
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | ExLlamaV2 | Yes | The parent model instance providing config and architecture parameters |
| device_idx | int | Yes | CUDA device index (e.g., 0, 1) or -1 for CPU |
| scratch_bytes | int | Yes | Number of bytes to allocate for the scratch buffer |
| archparams | object | No | Architecture parameters; defaults to model.config.arch.lm |
| scratch | bool | Yes (prepare) | Whether to allocate the scratch buffer during preparation |
| size_bytes | int | Yes (get_scratch_slice) | Number of bytes requested for a scratch slice (64-byte aligned internally) |
| index | int | Yes (get_device_stream) | CUDA device index to look up the associated stream |
Outputs
| Name | Type | Description |
|---|---|---|
| context instance | ExLlamaV2DeviceContext | Fully initialised device context with stream, scratch buffer, and RoPE tables |
| scratch_slice | torch.Tensor | From get_scratch_slice: a narrow view into the scratch buffer of the requested size |
| stream | torch.cuda.Stream or None | From get_device_stream: the CUDA stream for the given device, or None |
| sin, cos | list[torch.Tensor] | Precomputed RoPE sine and cosine tensors of shape (1, 1, max_seq_len, head_dim) |
Usage Examples
Accessing Device Context Resources
from exllamav2.device import ExLlamaV2DeviceContext, get_device_stream
# Typically created internally by the model during load
# Accessing scratch memory for a kernel
ctx = model.device_context[0] # Context for GPU 0
ctx.begin_scratch_alloc()
scratch = ctx.get_scratch_slice(1024 * 1024) # 1 MB scratch slice
# Accessing precomputed RoPE tables
sin_table = ctx.sin[0] # Primary RoPE sine table
cos_table = ctx.cos[0] # Primary RoPE cosine table
Using Device Streams
from exllamav2.device import get_device_stream
import torch
stream = get_device_stream(0)
if stream is not None:
with torch.cuda.stream(stream):
# Perform operations on the device-specific stream
result = tensor_a @ tensor_b
Related Pages
Implements Principle
Requires Environment
Used By
- Implementation:Turboderp_org_Exllamav2_Load_Autosplit
- Implementation:Turboderp_org_Exllamav2_Model_Init