Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Turboderp org Exllamav2 Util

From Leeroopedia
Revision as of 14:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Turboderp_org_Exllamav2_Util.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Utilities, GPU Memory, Tensor Manipulation
Last Updated 2026-02-15 00:00 GMT

Overview

The exllamav2.util module provides general-purpose utility classes and functions for timing, dynamic tensor management, CUDA synchronization, GPU memory introspection, integer partitioning, and 4-bit tensor packing/unpacking.

Description

This module collects a variety of helper primitives used throughout the ExLlamaV2 codebase:

Timer is a simple context manager that records wall-clock elapsed time between entry and exit.

timed is a decorator that tracks per-function execution time with a rolling average over the last 10 calls, printing timing information to stdout.

SeqTensor is a growable tensor container optimized for sequential appending along a configurable dimension. It pre-allocates capacity in pages of 256 elements and grows by concatenating new pages as needed, avoiding frequent re-allocation. It supports append(), truncate(), slice(), clone(), and conversion back to a standard torch.Tensor via torch(). This is used extensively in the streaming generator for accumulating token sequences.

cuda_sync_active() synchronizes only CUDA devices that have active memory allocations, avoiding the creation of unnecessary CUDA contexts on unused devices (which happens with the standard torch.cuda.synchronize()).

get_all_gpu_memory() queries VRAM usage for both NVIDIA (via nvidia-smi) and AMD (via rocm-smi) GPUs, returning a dictionary keyed by device index with total, used, and free memory in MB. It respects CUDA_VISIBLE_DEVICES.

integer_split() precisely partitions an integer into portions according to a given ratio, ensuring the portions sum exactly to the input. It supports a minimum threshold, redistributing portions that fall below it. This is the core algorithm behind tensor-parallel split computation in TPContext.

unpack_4bit() and pack_4bit() convert between packed int32 tensors (8 four-bit values per element) and unpacked uint8 tensors, used for working with 4-bit quantized weight representations.

Additional debugging helpers include list_live_tensors() (enumerates all live torch tensors via the garbage collector), set_snapshot() / diff_snapshot() (for comparing tensor allocations between two points in time), print_vram_usage() and print_vram_usage_peak() (print peak VRAM on cuda:0), and get_basic_progress() (returns a Rich progress bar instance).

Usage

Import individual utilities as needed. cuda_sync_active() should be preferred over torch.cuda.synchronize() in multi-GPU setups. SeqTensor is ideal whenever tokens or embeddings must be accumulated incrementally. integer_split() and get_all_gpu_memory() are used internally by tensor parallelism but can also be called directly for custom device placement logic.

Code Reference

Source Location

Signature

class Timer:
    """Context manager that records elapsed wall-clock time."""
    start_time: float
    end_time: float
    interval: float
    def __enter__(self) -> Timer: ...
    def __exit__(self, exc_type, exc_val, exc_tb): ...

class SeqTensor:
    """Growable tensor with paged allocation along a sequence dimension."""
    PAGE_SIZE: int = 256
    tensor: torch.Tensor
    seq_dim: int
    seq_len: int
    seq_cap: int

    def __init__(
        self,
        shape: tuple,
        dtype: torch.dtype,
        seq_dim: int,
        device: torch.device = "cpu",
        init_cap: int = -1
    ): ...

    def append(self, new_data: SeqTensor | torch.Tensor | None): ...
    def truncate(self, new_len: int): ...
    def torch(self) -> torch.Tensor: ...
    def slice(self, a: int | None, b: int | None) -> SeqTensor: ...
    def clone(self, drop: int | None = None) -> SeqTensor: ...

def cuda_sync_active(): ...
def get_all_gpu_memory() -> dict[int, dict[str, int]]: ...
def integer_split(x: int, split: list[int], minimum: int = 0) -> list[int]: ...
def unpack_4bit(packed: torch.Tensor) -> torch.Tensor: ...
def pack_4bit(unpacked: torch.Tensor) -> torch.Tensor: ...

Import

from exllamav2.util import Timer, SeqTensor, cuda_sync_active, get_all_gpu_memory, integer_split

I/O Contract

Inputs (SeqTensor.__init__)

Name Type Required Description
shape tuple Yes Shape of the tensor; the seq_dim dimension will be overridden by the initial capacity
dtype torch.dtype Yes Data type of the underlying tensor
seq_dim int Yes Which dimension is the sequence (growable) dimension; supports negative indexing
device torch.device No (default "cpu") Device to allocate the tensor on
init_cap int No (default -1) Initial capacity; -1 defaults to PAGE_SIZE (256)

Inputs (integer_split)

Name Type Required Description
x int Yes The integer to split into portions
split list[int] Yes Ratio weights for each portion (e.g. available VRAM per GPU in MB)
minimum int No (default 0) Minimum portion size; portions below this are zeroed and their count redistributed

Inputs (unpack_4bit)

Name Type Required Description
packed torch.Tensor Yes Shape (m, n//8), dtype torch.int32; each int32 holds 8 packed 4-bit values

Inputs (pack_4bit)

Name Type Required Description
unpacked torch.Tensor Yes Shape (m, n), dtype torch.uint8; n must be divisible by 8

Outputs

Name Type Description
Timer.interval float Elapsed time in seconds between __enter__ and __exit__
SeqTensor.torch() torch.Tensor A view of the underlying tensor trimmed to the current sequence length
cuda_sync_active() None Synchronizes all CUDA devices with active memory allocations
get_all_gpu_memory() dict[int, dict[str, int]] Per-GPU dict with keys "total", "used", "free" in MB
integer_split() list[int] List of integer portions that sum exactly to x
unpack_4bit() torch.Tensor Shape (m, n), dtype torch.uint8 with values 0-15
pack_4bit() torch.Tensor Shape (m, n//8), dtype torch.int32 with packed 4-bit values

Usage Examples

Timer

from exllamav2.util import Timer

with Timer() as t:
    # Perform some computation
    result = model.forward(input_ids, cache=cache)

print(f"Forward pass took {t.interval:.4f} seconds")

SeqTensor

import torch
from exllamav2.util import SeqTensor

# Create a growable 1-D token sequence
seq = SeqTensor(
    shape=(1, 0),
    dtype=torch.long,
    seq_dim=1,
    device="cpu"
)

# Append tokens incrementally
seq.append(torch.tensor([[101, 2003, 1037]]))
seq.append(torch.tensor([[3231]]))

print(len(seq))       # 4
print(seq.torch())    # tensor([[101, 2003, 1037, 3231]])

# Slice and clone
first_two = seq.slice(0, 2)
print(first_two.torch())  # tensor([[101, 2003]])

# Clone with drop (keep only last N)
recent = seq.clone(drop=2)
print(recent.torch())     # tensor([[1037, 3231]])

cuda_sync_active

from exllamav2.util import cuda_sync_active

# Synchronize only GPUs that are actually in use
# (avoids creating a CUDA context on cuda:0 if unused)
cuda_sync_active()

integer_split

from exllamav2.util import integer_split

# Split 32 KV heads across 2 GPUs with 20GB and 12GB available
portions = integer_split(32, [20480, 12288])
print(portions)  # [20, 12] (proportional, sums to 32)

# With a minimum threshold of 4
portions = integer_split(32, [20480, 12288, 512], minimum=4)
print(portions)  # Small GPU gets 0, redistributed to others

4-bit Packing

import torch
from exllamav2.util import pack_4bit, unpack_4bit

# Pack 4-bit values into int32
values = torch.randint(0, 16, (4, 64), dtype=torch.uint8)
packed = pack_4bit(values)
print(packed.shape)   # torch.Size([4, 8])
print(packed.dtype)   # torch.int32

# Unpack back
unpacked = unpack_4bit(packed)
assert torch.equal(values, unpacked)

All Exported Symbols

Symbol Type Lines Description
Timer class 8-15 Context manager for wall-clock timing
timed decorator 21-37 Decorator that logs per-call and rolling-average execution time
SeqTensor class 40-132 Growable tensor with paged allocation along a sequence dimension
cuda_sync_active() function 135-143 Synchronize only CUDA devices with active allocations
get_basic_progress() function 146-154 Returns a Rich Progress bar instance
list_live_tensors() function 157-176 Prints all live torch tensors grouped by shape/dtype/device
set_snapshot() function 181-194 Captures the current set of live tensors for later comparison
diff_snapshot() function 197-224 Prints new and removed tensors since the last set_snapshot() call
print_vram_usage() function 227-231 Prints peak VRAM usage on cuda:0 (resets peak counter)
print_vram_usage_peak() function 234-237 Prints peak VRAM usage on cuda:0 (does not reset)
get_all_gpu_memory() function 305-331 Queries VRAM for NVIDIA and AMD GPUs, returns dict
integer_split() function 334-353 Splits integer proportionally with exact sum guarantee
unpack_4bit() function 356-373 Unpacks int32 tensor into uint8 4-bit values
pack_4bit() function 376-393 Packs uint8 4-bit values into int32 tensor

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment