Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Turboderp org Exllamav2 ExLlamaV2DeviceContext

From Leeroopedia
Knowledge Sources
Domains Device_Management, CUDA
Last Updated 2026-02-15 00:00 GMT

Overview

Per-device CUDA context manager that owns the scratch memory buffer, precomputed RoPE sine/cosine tables, and CUDA stream for a single GPU within an ExLlamaV2 multi-device deployment.

Description

ExLlamaV2DeviceContext encapsulates all device-specific resources needed by a single GPU during ExLlamaV2 inference. Each GPU in a model split receives its own context instance.

Key components:

  • __init__(model, device_idx, scratch_bytes, archparams) -- Initialises the context for a given CUDA device index. Creates a CUDA stream (with priority -100) if one does not already exist for that device, storing it in the module-level global_streams dictionary.
  • prepare(scratch) -- Calls prepare_sincos() to precompute RoPE tables and, if scratch is True, allocates the scratch buffer as a half-precision tensor of scratch_bytes // 2 elements.
  • drop() -- Releases scratch memory, RoPE tables, and marks the context as not ready.
  • free() -- Calls drop() and resets scratch_bytes to 1, effectively disabling future allocations.
  • begin_scratch_alloc() -- Resets the scratch allocation pointer (scratch_idx) to 0, preparing for a new round of scratch slice requests.
  • get_scratch_slice(size_bytes) -- Returns a narrow view into the scratch buffer of the requested size (64-byte aligned). Advances the allocation pointer. Lazily calls prepare(True) if scratch has not yet been allocated.
  • prepare_sincos() -- Precomputes sine and cosine embedding tables for Rotary Position Embeddings (RoPE). Supports both NeoX-style (concatenated) and GPT-J-style (interleaved) RoPE layouts. Handles dual-theta configurations (rotary_embedding_base_alt) and applies position scaling factors.

Global functions:

  • set_device_streams() -- Iterates over global_streams and sets each as the active CUDA stream for its device.
  • get_device_stream(index) -- Returns the CUDA stream for a given device index, or None if not registered.

Usage

Use ExLlamaV2DeviceContext during model loading and inference to manage per-GPU resources. It is created internally by ExLlamaV2 (the model class) for each GPU in the device map and should not normally be instantiated directly by users.

Code Reference

Source Location

Signature

global_streams: dict = {}

def set_device_streams() -> None:
    ...

def get_device_stream(index: int) -> torch.cuda.Stream | None:
    ...

class ExLlamaV2DeviceContext:

    model: ExLlamaV2
    device_idx: int
    ready: bool
    scratch_bytes: int
    scratch_idx: int
    sin: list[torch.Tensor] | None
    cos: list[torch.Tensor] | None
    scratch: torch.Tensor | None
    stream: torch.cuda.Stream

    def __init__(
        self,
        model: ExLlamaV2,
        device_idx: int,
        scratch_bytes: int,
        archparams=None
    ):
        ...

    def prepare(self, scratch: bool) -> None:
        ...

    def drop(self) -> None:
        ...

    def free(self) -> None:
        ...

    def begin_scratch_alloc(self) -> None:
        ...

    def get_scratch_slice(self, size_bytes: int) -> torch.Tensor:
        ...

    def prepare_sincos(self) -> None:
        ...

Import

from exllamav2.device import ExLlamaV2DeviceContext, get_device_stream

I/O Contract

Inputs

Name Type Required Description
model ExLlamaV2 Yes The parent model instance providing config and architecture parameters
device_idx int Yes CUDA device index (e.g., 0, 1) or -1 for CPU
scratch_bytes int Yes Number of bytes to allocate for the scratch buffer
archparams object No Architecture parameters; defaults to model.config.arch.lm
scratch bool Yes (prepare) Whether to allocate the scratch buffer during preparation
size_bytes int Yes (get_scratch_slice) Number of bytes requested for a scratch slice (64-byte aligned internally)
index int Yes (get_device_stream) CUDA device index to look up the associated stream

Outputs

Name Type Description
context instance ExLlamaV2DeviceContext Fully initialised device context with stream, scratch buffer, and RoPE tables
scratch_slice torch.Tensor From get_scratch_slice: a narrow view into the scratch buffer of the requested size
stream torch.cuda.Stream or None From get_device_stream: the CUDA stream for the given device, or None
sin, cos list[torch.Tensor] Precomputed RoPE sine and cosine tensors of shape (1, 1, max_seq_len, head_dim)

Usage Examples

Accessing Device Context Resources

from exllamav2.device import ExLlamaV2DeviceContext, get_device_stream

# Typically created internally by the model during load
# Accessing scratch memory for a kernel
ctx = model.device_context[0]  # Context for GPU 0
ctx.begin_scratch_alloc()
scratch = ctx.get_scratch_slice(1024 * 1024)  # 1 MB scratch slice

# Accessing precomputed RoPE tables
sin_table = ctx.sin[0]  # Primary RoPE sine table
cos_table = ctx.cos[0]  # Primary RoPE cosine table

Using Device Streams

from exllamav2.device import get_device_stream
import torch

stream = get_device_stream(0)
if stream is not None:
    with torch.cuda.stream(stream):
        # Perform operations on the device-specific stream
        result = tensor_a @ tensor_b

Related Pages

Implements Principle

Requires Environment

Used By

Depends On

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment