Implementation:Mit han lab Llm awq WQLinear from linear
Appearance
Overview
Concrete tool for creating INT4 quantized linear layers from standard linear layers provided by the llm-awq library.
Source
File: awq/quantize/qmodule.py, Lines 139-199 (from_linear classmethod), Lines 78-235 (full class)
Signature
class WQLinear(nn.Module):
def __init__(self, w_bit, group_size, in_features, out_features, bias, dev, dtype=torch.float16):
...
@classmethod
def from_linear(cls, linear, w_bit, group_size, init_only=False, scales=None, zeros=None):
...
Import
from awq.quantize.qmodule import WQLinear
I/O
Inputs (from_linear)
- linear (nn.Linear) - the standard linear layer to quantize
- w_bit (int) - weight bit width, must be 4
- group_size (int) - quantization group size, typically 128
- init_only (bool) - if True, create empty shell without packing weights
- scales (torch.Tensor, optional) - pre-computed quantization scales
- zeros (torch.Tensor, optional) - pre-computed quantization zeros
Output
- WQLinear instance with packed qweight, scales, and scaled_zeros buffers
Forward Method
The forward method dispatches to GEMV for batch size < 8 and GEMM otherwise, both via awq_inference_engine.
Related Pages
- Principle:Mit_han_lab_Llm_awq_Quantized_Linear_Module
- Environment:Mit_han_lab_Llm_awq_CUDA_Build_Environment
- Heuristic:Mit_han_lab_Llm_awq_Kernel_Selection_Thresholds
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Domains
- Quantization
- Inference
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment