Implementation:Turboderp org Exllamav2 Util

Knowledge Sources	Turboderp_org_Exllamav2
Domains	Utilities, GPU Memory, Tensor Manipulation
Last Updated	2026-02-15 00:00 GMT

Overview

The exllamav2.util module provides general-purpose utility classes and functions for timing, dynamic tensor management, CUDA synchronization, GPU memory introspection, integer partitioning, and 4-bit tensor packing/unpacking.

Description

This module collects a variety of helper primitives used throughout the ExLlamaV2 codebase:

Timer is a simple context manager that records wall-clock elapsed time between entry and exit.

timed is a decorator that tracks per-function execution time with a rolling average over the last 10 calls, printing timing information to stdout.

SeqTensor is a growable tensor container optimized for sequential appending along a configurable dimension. It pre-allocates capacity in pages of 256 elements and grows by concatenating new pages as needed, avoiding frequent re-allocation. It supports append(), truncate(), slice(), clone(), and conversion back to a standard torch.Tensor via torch(). This is used extensively in the streaming generator for accumulating token sequences.

cuda_sync_active() synchronizes only CUDA devices that have active memory allocations, avoiding the creation of unnecessary CUDA contexts on unused devices (which happens with the standard torch.cuda.synchronize()).

get_all_gpu_memory() queries VRAM usage for both NVIDIA (via nvidia-smi) and AMD (via rocm-smi) GPUs, returning a dictionary keyed by device index with total, used, and free memory in MB. It respects CUDA_VISIBLE_DEVICES.

integer_split() precisely partitions an integer into portions according to a given ratio, ensuring the portions sum exactly to the input. It supports a minimum threshold, redistributing portions that fall below it. This is the core algorithm behind tensor-parallel split computation in TPContext.

unpack_4bit() and pack_4bit() convert between packed int32 tensors (8 four-bit values per element) and unpacked uint8 tensors, used for working with 4-bit quantized weight representations.

Additional debugging helpers include list_live_tensors() (enumerates all live torch tensors via the garbage collector), set_snapshot() / diff_snapshot() (for comparing tensor allocations between two points in time), print_vram_usage() and print_vram_usage_peak() (print peak VRAM on cuda:0), and get_basic_progress() (returns a Rich progress bar instance).

Usage

Import individual utilities as needed. cuda_sync_active() should be preferred over torch.cuda.synchronize() in multi-GPU setups. SeqTensor is ideal whenever tokens or embeddings must be accumulated incrementally. integer_split() and get_all_gpu_memory() are used internally by tensor parallelism but can also be called directly for custom device placement logic.

Code Reference

Source Location

Repository: Turboderp_org_Exllamav2
File: exllamav2/util.py
Lines: 1-393

Signature

class Timer:
    """Context manager that records elapsed wall-clock time."""
    start_time: float
    end_time: float
    interval: float
    def __enter__(self) -> Timer: ...
    def __exit__(self, exc_type, exc_val, exc_tb): ...

class SeqTensor:
    """Growable tensor with paged allocation along a sequence dimension."""
    PAGE_SIZE: int = 256
    tensor: torch.Tensor
    seq_dim: int
    seq_len: int
    seq_cap: int

    def __init__(
        self,
        shape: tuple,
        dtype: torch.dtype,
        seq_dim: int,
        device: torch.device = "cpu",
        init_cap: int = -1
    ): ...

    def append(self, new_data: SeqTensor | torch.Tensor | None): ...
    def truncate(self, new_len: int): ...
    def torch(self) -> torch.Tensor: ...
    def slice(self, a: int | None, b: int | None) -> SeqTensor: ...
    def clone(self, drop: int | None = None) -> SeqTensor: ...

def cuda_sync_active(): ...
def get_all_gpu_memory() -> dict[int, dict[str, int]]: ...
def integer_split(x: int, split: list[int], minimum: int = 0) -> list[int]: ...
def unpack_4bit(packed: torch.Tensor) -> torch.Tensor: ...
def pack_4bit(unpacked: torch.Tensor) -> torch.Tensor: ...

Import

from exllamav2.util import Timer, SeqTensor, cuda_sync_active, get_all_gpu_memory, integer_split

I/O Contract

Inputs (SeqTensor.init)

Name	Type	Required	Description
shape	tuple	Yes	Shape of the tensor; the seq_dim dimension will be overridden by the initial capacity
dtype	torch.dtype	Yes	Data type of the underlying tensor
seq_dim	int	Yes	Which dimension is the sequence (growable) dimension; supports negative indexing
device	torch.device	No (default "cpu")	Device to allocate the tensor on
init_cap	int	No (default -1)	Initial capacity; -1 defaults to PAGE_SIZE (256)

Inputs (integer_split)

Name	Type	Required	Description
x	int	Yes	The integer to split into portions
split	list[int]	Yes	Ratio weights for each portion (e.g. available VRAM per GPU in MB)
minimum	int	No (default 0)	Minimum portion size; portions below this are zeroed and their count redistributed

Inputs (unpack_4bit)

Name	Type	Required	Description
packed	torch.Tensor	Yes	Shape (m, n//8), dtype torch.int32; each int32 holds 8 packed 4-bit values

Inputs (pack_4bit)

Name	Type	Required	Description
unpacked	torch.Tensor	Yes	Shape (m, n), dtype torch.uint8; n must be divisible by 8

Outputs

Name	Type	Description
Timer.interval	float	Elapsed time in seconds between __enter__ and __exit__
SeqTensor.torch()	torch.Tensor	A view of the underlying tensor trimmed to the current sequence length
cuda_sync_active()	None	Synchronizes all CUDA devices with active memory allocations
get_all_gpu_memory()	dict[int, dict[str, int]]	Per-GPU dict with keys "total", "used", "free" in MB
integer_split()	list[int]	List of integer portions that sum exactly to x
unpack_4bit()	torch.Tensor	Shape (m, n), dtype torch.uint8 with values 0-15
pack_4bit()	torch.Tensor	Shape (m, n//8), dtype torch.int32 with packed 4-bit values

Usage Examples

Timer

from exllamav2.util import Timer

with Timer() as t:
    # Perform some computation
    result = model.forward(input_ids, cache=cache)

print(f"Forward pass took {t.interval:.4f} seconds")

SeqTensor

import torch
from exllamav2.util import SeqTensor

# Create a growable 1-D token sequence
seq = SeqTensor(
    shape=(1, 0),
    dtype=torch.long,
    seq_dim=1,
    device="cpu"
)

# Append tokens incrementally
seq.append(torch.tensor([[101, 2003, 1037]]))
seq.append(torch.tensor([[3231]]))

print(len(seq))       # 4
print(seq.torch())    # tensor([[101, 2003, 1037, 3231]])

# Slice and clone
first_two = seq.slice(0, 2)
print(first_two.torch())  # tensor([[101, 2003]])

# Clone with drop (keep only last N)
recent = seq.clone(drop=2)
print(recent.torch())     # tensor([[1037, 3231]])

cuda_sync_active

from exllamav2.util import cuda_sync_active

# Synchronize only GPUs that are actually in use
# (avoids creating a CUDA context on cuda:0 if unused)
cuda_sync_active()

integer_split

from exllamav2.util import integer_split

# Split 32 KV heads across 2 GPUs with 20GB and 12GB available
portions = integer_split(32, [20480, 12288])
print(portions)  # [20, 12] (proportional, sums to 32)

# With a minimum threshold of 4
portions = integer_split(32, [20480, 12288, 512], minimum=4)
print(portions)  # Small GPU gets 0, redistributed to others

4-bit Packing

import torch
from exllamav2.util import pack_4bit, unpack_4bit

# Pack 4-bit values into int32
values = torch.randint(0, 16, (4, 64), dtype=torch.uint8)
packed = pack_4bit(values)
print(packed.shape)   # torch.Size([4, 8])
print(packed.dtype)   # torch.int32

# Unpack back
unpacked = unpack_4bit(packed)
assert torch.equal(values, unpacked)

All Exported Symbols

Symbol	Type	Lines	Description
Timer	class	8-15	Context manager for wall-clock timing
timed	decorator	21-37	Decorator that logs per-call and rolling-average execution time
SeqTensor	class	40-132	Growable tensor with paged allocation along a sequence dimension
cuda_sync_active()	function	135-143	Synchronize only CUDA devices with active allocations
get_basic_progress()	function	146-154	Returns a Rich Progress bar instance
list_live_tensors()	function	157-176	Prints all live torch tensors grouped by shape/dtype/device
set_snapshot()	function	181-194	Captures the current set of live tensors for later comparison
diff_snapshot()	function	197-224	Prints new and removed tensors since the last set_snapshot() call
print_vram_usage()	function	227-231	Prints peak VRAM usage on cuda:0 (resets peak counter)
print_vram_usage_peak()	function	234-237	Prints peak VRAM usage on cuda:0 (does not reset)
get_all_gpu_memory()	function	305-331	Queries VRAM for NVIDIA and AMD GPUs, returns dict
integer_split()	function	334-353	Splits integer proportionally with exact sum guarantee
unpack_4bit()	function	356-373	Unpacks int32 tensor into uint8 4-bit values
pack_4bit()	function	376-393	Packs uint8 4-bit values into int32 tensor

Related Pages

Environment:Turboderp_org_Exllamav2_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment