Implementation:Mit han lab Llm awq Device warmup
Appearance
Overview
Concrete tools for GPU warmup and quantized kernel auto-tuning provided by the llm-awq library.
Source
tinychat/utils/tune.py, Lines 10-81
Signatures
def device_warmup(device: str):
def tune_all_wqlinears(model, measure_iters: int = 1000):
Import
from tinychat.utils.tune import device_warmup, tune_all_wqlinears
I/O
device_warmup
Inputs:
- device (str) - the target device, e.g. "cuda:0"
Output:
- None - warms up the GPU by running 8192x8192 matrix multiplications to initialize CUDA contexts, memory allocators, and GPU caches
tune_all_wqlinears
Inputs:
- model (nn.Module) - the loaded model containing WQLinear layers
- measure_iters (int, default 1000) - number of benchmark iterations per configuration
Output:
- None - sets the optimal split_k_iters attribute on each WQLinear layer in-place
Internal behavior:
- Scans the model for all WQLinear modules
- Groups them by unique (in_features, out_features) shapes
- For each unique shape, calls tune_wqlinear which benchmarks split_k_iters values of 1, 2, 4, 8, 16, and 32
- Selects the value with the lowest median latency
- Applies the optimal split_k_iters to all WQLinear layers sharing that shape
Related Pages
- Principle:Mit_han_lab_Llm_awq_Device_Warmup_and_Kernel_Tuning
- Environment:Mit_han_lab_Llm_awq_CUDA_Build_Environment
- Heuristic:Mit_han_lab_Llm_awq_Kernel_Selection_Thresholds
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Domains
- Inference
- Optimization
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment