Implementation:Mit han lab Llm awq Device warmup

Overview

Concrete tools for GPU warmup and quantized kernel auto-tuning provided by the llm-awq library.

tinychat/utils/tune.py, Lines 10-81

def device_warmup(device: str):

def tune_all_wqlinears(model, measure_iters: int = 1000):

from tinychat.utils.tune import device_warmup, tune_all_wqlinears

Inputs:

Output:

None - warms up the GPU by running 8192x8192 matrix multiplications to initialize CUDA contexts, memory allocators, and GPU caches

Inputs:

model (nn.Module) - the loaded model containing WQLinear layers
measure_iters (int, default 1000) - number of benchmark iterations per configuration

Output:

None - sets the optimal split_k_iters attribute on each WQLinear layer in-place

Internal behavior:

Scans the model for all WQLinear modules
Groups them by unique (in_features, out_features) shapes
For each unique shape, calls tune_wqlinear which benchmarks split_k_iters values of 1, 2, 4, 8, 16, and 32
Selects the value with the lowest median latency
Applies the optimal split_k_iters to all WQLinear layers sharing that shape

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment