Implementation:Turboderp org Exllamav2 Util
| Knowledge Sources | |
|---|---|
| Domains | Utilities, GPU Memory, Tensor Manipulation |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
The exllamav2.util module provides general-purpose utility classes and functions for timing, dynamic tensor management, CUDA synchronization, GPU memory introspection, integer partitioning, and 4-bit tensor packing/unpacking.
Description
This module collects a variety of helper primitives used throughout the ExLlamaV2 codebase:
Timer is a simple context manager that records wall-clock elapsed time between entry and exit.
timed is a decorator that tracks per-function execution time with a rolling average over the last 10 calls, printing timing information to stdout.
SeqTensor is a growable tensor container optimized for sequential appending along a configurable dimension. It pre-allocates capacity in pages of 256 elements and grows by concatenating new pages as needed, avoiding frequent re-allocation. It supports append(), truncate(), slice(), clone(), and conversion back to a standard torch.Tensor via torch(). This is used extensively in the streaming generator for accumulating token sequences.
cuda_sync_active() synchronizes only CUDA devices that have active memory allocations, avoiding the creation of unnecessary CUDA contexts on unused devices (which happens with the standard torch.cuda.synchronize()).
get_all_gpu_memory() queries VRAM usage for both NVIDIA (via nvidia-smi) and AMD (via rocm-smi) GPUs, returning a dictionary keyed by device index with total, used, and free memory in MB. It respects CUDA_VISIBLE_DEVICES.
integer_split() precisely partitions an integer into portions according to a given ratio, ensuring the portions sum exactly to the input. It supports a minimum threshold, redistributing portions that fall below it. This is the core algorithm behind tensor-parallel split computation in TPContext.
unpack_4bit() and pack_4bit() convert between packed int32 tensors (8 four-bit values per element) and unpacked uint8 tensors, used for working with 4-bit quantized weight representations.
Additional debugging helpers include list_live_tensors() (enumerates all live torch tensors via the garbage collector), set_snapshot() / diff_snapshot() (for comparing tensor allocations between two points in time), print_vram_usage() and print_vram_usage_peak() (print peak VRAM on cuda:0), and get_basic_progress() (returns a Rich progress bar instance).
Usage
Import individual utilities as needed. cuda_sync_active() should be preferred over torch.cuda.synchronize() in multi-GPU setups. SeqTensor is ideal whenever tokens or embeddings must be accumulated incrementally. integer_split() and get_all_gpu_memory() are used internally by tensor parallelism but can also be called directly for custom device placement logic.
Code Reference
Source Location
- Repository: Turboderp_org_Exllamav2
- File: exllamav2/util.py
- Lines: 1-393
Signature
class Timer:
"""Context manager that records elapsed wall-clock time."""
start_time: float
end_time: float
interval: float
def __enter__(self) -> Timer: ...
def __exit__(self, exc_type, exc_val, exc_tb): ...
class SeqTensor:
"""Growable tensor with paged allocation along a sequence dimension."""
PAGE_SIZE: int = 256
tensor: torch.Tensor
seq_dim: int
seq_len: int
seq_cap: int
def __init__(
self,
shape: tuple,
dtype: torch.dtype,
seq_dim: int,
device: torch.device = "cpu",
init_cap: int = -1
): ...
def append(self, new_data: SeqTensor | torch.Tensor | None): ...
def truncate(self, new_len: int): ...
def torch(self) -> torch.Tensor: ...
def slice(self, a: int | None, b: int | None) -> SeqTensor: ...
def clone(self, drop: int | None = None) -> SeqTensor: ...
def cuda_sync_active(): ...
def get_all_gpu_memory() -> dict[int, dict[str, int]]: ...
def integer_split(x: int, split: list[int], minimum: int = 0) -> list[int]: ...
def unpack_4bit(packed: torch.Tensor) -> torch.Tensor: ...
def pack_4bit(unpacked: torch.Tensor) -> torch.Tensor: ...
Import
from exllamav2.util import Timer, SeqTensor, cuda_sync_active, get_all_gpu_memory, integer_split
I/O Contract
Inputs (SeqTensor.__init__)
| Name | Type | Required | Description |
|---|---|---|---|
| shape | tuple | Yes | Shape of the tensor; the seq_dim dimension will be overridden by the initial capacity |
| dtype | torch.dtype | Yes | Data type of the underlying tensor |
| seq_dim | int | Yes | Which dimension is the sequence (growable) dimension; supports negative indexing |
| device | torch.device | No (default "cpu") | Device to allocate the tensor on |
| init_cap | int | No (default -1) | Initial capacity; -1 defaults to PAGE_SIZE (256) |
Inputs (integer_split)
| Name | Type | Required | Description |
|---|---|---|---|
| x | int | Yes | The integer to split into portions |
| split | list[int] | Yes | Ratio weights for each portion (e.g. available VRAM per GPU in MB) |
| minimum | int | No (default 0) | Minimum portion size; portions below this are zeroed and their count redistributed |
Inputs (unpack_4bit)
| Name | Type | Required | Description |
|---|---|---|---|
| packed | torch.Tensor | Yes | Shape (m, n//8), dtype torch.int32; each int32 holds 8 packed 4-bit values |
Inputs (pack_4bit)
| Name | Type | Required | Description |
|---|---|---|---|
| unpacked | torch.Tensor | Yes | Shape (m, n), dtype torch.uint8; n must be divisible by 8 |
Outputs
| Name | Type | Description |
|---|---|---|
| Timer.interval | float | Elapsed time in seconds between __enter__ and __exit__ |
| SeqTensor.torch() | torch.Tensor | A view of the underlying tensor trimmed to the current sequence length |
| cuda_sync_active() | None | Synchronizes all CUDA devices with active memory allocations |
| get_all_gpu_memory() | dict[int, dict[str, int]] | Per-GPU dict with keys "total", "used", "free" in MB |
| integer_split() | list[int] | List of integer portions that sum exactly to x |
| unpack_4bit() | torch.Tensor | Shape (m, n), dtype torch.uint8 with values 0-15 |
| pack_4bit() | torch.Tensor | Shape (m, n//8), dtype torch.int32 with packed 4-bit values |
Usage Examples
Timer
from exllamav2.util import Timer
with Timer() as t:
# Perform some computation
result = model.forward(input_ids, cache=cache)
print(f"Forward pass took {t.interval:.4f} seconds")
SeqTensor
import torch
from exllamav2.util import SeqTensor
# Create a growable 1-D token sequence
seq = SeqTensor(
shape=(1, 0),
dtype=torch.long,
seq_dim=1,
device="cpu"
)
# Append tokens incrementally
seq.append(torch.tensor([[101, 2003, 1037]]))
seq.append(torch.tensor([[3231]]))
print(len(seq)) # 4
print(seq.torch()) # tensor([[101, 2003, 1037, 3231]])
# Slice and clone
first_two = seq.slice(0, 2)
print(first_two.torch()) # tensor([[101, 2003]])
# Clone with drop (keep only last N)
recent = seq.clone(drop=2)
print(recent.torch()) # tensor([[1037, 3231]])
cuda_sync_active
from exllamav2.util import cuda_sync_active
# Synchronize only GPUs that are actually in use
# (avoids creating a CUDA context on cuda:0 if unused)
cuda_sync_active()
integer_split
from exllamav2.util import integer_split
# Split 32 KV heads across 2 GPUs with 20GB and 12GB available
portions = integer_split(32, [20480, 12288])
print(portions) # [20, 12] (proportional, sums to 32)
# With a minimum threshold of 4
portions = integer_split(32, [20480, 12288, 512], minimum=4)
print(portions) # Small GPU gets 0, redistributed to others
4-bit Packing
import torch
from exllamav2.util import pack_4bit, unpack_4bit
# Pack 4-bit values into int32
values = torch.randint(0, 16, (4, 64), dtype=torch.uint8)
packed = pack_4bit(values)
print(packed.shape) # torch.Size([4, 8])
print(packed.dtype) # torch.int32
# Unpack back
unpacked = unpack_4bit(packed)
assert torch.equal(values, unpacked)
All Exported Symbols
| Symbol | Type | Lines | Description |
|---|---|---|---|
| Timer | class | 8-15 | Context manager for wall-clock timing |
| timed | decorator | 21-37 | Decorator that logs per-call and rolling-average execution time |
| SeqTensor | class | 40-132 | Growable tensor with paged allocation along a sequence dimension |
| cuda_sync_active() | function | 135-143 | Synchronize only CUDA devices with active allocations |
| get_basic_progress() | function | 146-154 | Returns a Rich Progress bar instance |
| list_live_tensors() | function | 157-176 | Prints all live torch tensors grouped by shape/dtype/device |
| set_snapshot() | function | 181-194 | Captures the current set of live tensors for later comparison |
| diff_snapshot() | function | 197-224 | Prints new and removed tensors since the last set_snapshot() call |
| print_vram_usage() | function | 227-231 | Prints peak VRAM usage on cuda:0 (resets peak counter) |
| print_vram_usage_peak() | function | 234-237 | Prints peak VRAM usage on cuda:0 (does not reset) |
| get_all_gpu_memory() | function | 305-331 | Queries VRAM for NVIDIA and AMD GPUs, returns dict |
| integer_split() | function | 334-353 | Splits integer proportionally with exact sum guarantee |
| unpack_4bit() | function | 356-373 | Unpacks int32 tensor into uint8 4-bit values |
| pack_4bit() | function | 376-393 | Packs uint8 4-bit values into int32 tensor |