Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Mit han lab Llm awq Device warmup

From Leeroopedia
Revision as of 13:16, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Mit_han_lab_Llm_awq_Device_warmup.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Overview

Concrete tools for GPU warmup and quantized kernel auto-tuning provided by the llm-awq library.

Source

tinychat/utils/tune.py, Lines 10-81

Signatures

def device_warmup(device: str):

def tune_all_wqlinears(model, measure_iters: int = 1000):

Import

from tinychat.utils.tune import device_warmup, tune_all_wqlinears

I/O

device_warmup

Inputs:

  • device (str) - the target device, e.g. "cuda:0"

Output:

  • None - warms up the GPU by running 8192x8192 matrix multiplications to initialize CUDA contexts, memory allocators, and GPU caches

tune_all_wqlinears

Inputs:

  • model (nn.Module) - the loaded model containing WQLinear layers
  • measure_iters (int, default 1000) - number of benchmark iterations per configuration

Output:

  • None - sets the optimal split_k_iters attribute on each WQLinear layer in-place

Internal behavior:

  • Scans the model for all WQLinear modules
  • Groups them by unique (in_features, out_features) shapes
  • For each unique shape, calls tune_wqlinear which benchmarks split_k_iters values of 1, 2, 4, 8, 16, and 32
  • Selects the value with the lowest median latency
  • Applies the optimal split_k_iters to all WQLinear layers sharing that shape

Related Pages

Knowledge Sources

Domains

  • Inference
  • Optimization

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment