Implementation:Turboderp org Exllamav2 ExLlamaV2DeviceContext

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Device_Management, CUDA
Last Updated	2026-02-15 00:00 GMT

Overview

Per-device CUDA context manager that owns the scratch memory buffer, precomputed RoPE sine/cosine tables, and CUDA stream for a single GPU within an ExLlamaV2 multi-device deployment.

Description

ExLlamaV2DeviceContext encapsulates all device-specific resources needed by a single GPU during ExLlamaV2 inference. Each GPU in a model split receives its own context instance.

Key components:

__init__(model, device_idx, scratch_bytes, archparams) -- Initialises the context for a given CUDA device index. Creates a CUDA stream (with priority -100) if one does not already exist for that device, storing it in the module-level global_streams dictionary.
prepare(scratch) -- Calls prepare_sincos() to precompute RoPE tables and, if scratch is True, allocates the scratch buffer as a half-precision tensor of scratch_bytes // 2 elements.
drop() -- Releases scratch memory, RoPE tables, and marks the context as not ready.
free() -- Calls drop() and resets scratch_bytes to 1, effectively disabling future allocations.
begin_scratch_alloc() -- Resets the scratch allocation pointer (scratch_idx) to 0, preparing for a new round of scratch slice requests.
get_scratch_slice(size_bytes) -- Returns a narrow view into the scratch buffer of the requested size (64-byte aligned). Advances the allocation pointer. Lazily calls prepare(True) if scratch has not yet been allocated.
prepare_sincos() -- Precomputes sine and cosine embedding tables for Rotary Position Embeddings (RoPE). Supports both NeoX-style (concatenated) and GPT-J-style (interleaved) RoPE layouts. Handles dual-theta configurations (rotary_embedding_base_alt) and applies position scaling factors.

Global functions:

set_device_streams() -- Iterates over global_streams and sets each as the active CUDA stream for its device.
get_device_stream(index) -- Returns the CUDA stream for a given device index, or None if not registered.

Usage

Use ExLlamaV2DeviceContext during model loading and inference to manage per-GPU resources. It is created internally by ExLlamaV2 (the model class) for each GPU in the device map and should not normally be instantiated directly by users.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/device.py
Lines: L1-170

Signature

global_streams: dict = {}

def set_device_streams() -> None:
    ...

def get_device_stream(index: int) -> torch.cuda.Stream | None:
    ...

class ExLlamaV2DeviceContext:

    model: ExLlamaV2
    device_idx: int
    ready: bool
    scratch_bytes: int
    scratch_idx: int
    sin: list[torch.Tensor] | None
    cos: list[torch.Tensor] | None
    scratch: torch.Tensor | None
    stream: torch.cuda.Stream

    def __init__(
        self,
        model: ExLlamaV2,
        device_idx: int,
        scratch_bytes: int,
        archparams=None
    ):
        ...

    def prepare(self, scratch: bool) -> None:
        ...

    def drop(self) -> None:
        ...

    def free(self) -> None:
        ...

    def begin_scratch_alloc(self) -> None:
        ...

    def get_scratch_slice(self, size_bytes: int) -> torch.Tensor:
        ...

    def prepare_sincos(self) -> None:
        ...

Import

from exllamav2.device import ExLlamaV2DeviceContext, get_device_stream

I/O Contract

Inputs

Name	Type	Required	Description
model	ExLlamaV2	Yes	The parent model instance providing config and architecture parameters
device_idx	int	Yes	CUDA device index (e.g., 0, 1) or -1 for CPU
scratch_bytes	int	Yes	Number of bytes to allocate for the scratch buffer
archparams	object	No	Architecture parameters; defaults to model.config.arch.lm
scratch	bool	Yes (prepare)	Whether to allocate the scratch buffer during preparation
size_bytes	int	Yes (get_scratch_slice)	Number of bytes requested for a scratch slice (64-byte aligned internally)
index	int	Yes (get_device_stream)	CUDA device index to look up the associated stream

Outputs

Name	Type	Description
context instance	ExLlamaV2DeviceContext	Fully initialised device context with stream, scratch buffer, and RoPE tables
scratch_slice	torch.Tensor	From get_scratch_slice: a narrow view into the scratch buffer of the requested size
stream	torch.cuda.Stream or None	From get_device_stream: the CUDA stream for the given device, or None
sin, cos	list[torch.Tensor]	Precomputed RoPE sine and cosine tensors of shape (1, 1, max_seq_len, head_dim)

Usage Examples

Accessing Device Context Resources

from exllamav2.device import ExLlamaV2DeviceContext, get_device_stream

# Typically created internally by the model during load
# Accessing scratch memory for a kernel
ctx = model.device_context[0]  # Context for GPU 0
ctx.begin_scratch_alloc()
scratch = ctx.get_scratch_slice(1024 * 1024)  # 1 MB scratch slice

# Accessing precomputed RoPE tables
sin_table = ctx.sin[0]  # Primary RoPE sine table
cos_table = ctx.cos[0]  # Primary RoPE cosine table

Using Device Streams

from exllamav2.device import get_device_stream
import torch

stream = get_device_stream(0)
if stream is not None:
    with torch.cuda.stream(stream):
        # Perform operations on the device-specific stream
        result = tensor_a @ tensor_b

Related Pages

Implements Principle

Principle:Turboderp_org_Exllamav2_Device_Management

Requires Environment

Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime

Used By

Depends On

Implementation:Turboderp_org_Exllamav2_ExLlamaV2Config

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment