Implementation:Mit han lab Llm awq Real quantize model weight
Appearance
Overview
Concrete tool for converting model weights from FP16 to packed INT4 format provided by the llm-awq library.
Source
File: awq/quantize/quantizer.py, Lines: 125-165
Signature
@torch.no_grad()
def real_quantize_model_weight(model, w_bit, q_config, init_only=False):
Import
from awq.quantize.quantizer import real_quantize_model_weight
I/O
Inputs:
- model (nn.Module) - the model to quantize
- w_bit (int) - bit width, typically 4
- q_config (dict) - quantization configuration with zero_point=True, q_group_size=128
- init_only (bool) - if True, creates empty WQLinear shells without quantizing
Output:
- None (model modified in-place, all nn.Linear replaced with WQLinear)
Notes
Internally calls pseudo_quantize_tensor to get scales/zeros, then WQLinear.from_linear to create packed modules.
Related Pages
- Principle:Mit_han_lab_Llm_awq_INT4_Weight_Packing
- Environment:Mit_han_lab_Llm_awq_Python_Runtime_Environment
- Heuristic:Mit_han_lab_Llm_awq_GPU_Memory_Management_Patterns
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Domains
- Quantization
- Model_Compression
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment