Principle:Mit han lab Llm awq AWQ HuggingFace Export
Overview
Process of packaging AWQ-quantized model weights with HuggingFace-compatible configuration and uploading to the Hub for public distribution.
Description
The HuggingFace Transformers library (>=4.34) natively supports loading AWQ-quantized models via the AwqConfig quantization configuration. To share quantized models, the following steps are required:
- The original model's config is updated with quantization metadata including bits, group_size, zero_point, backend, and version
- The tokenizer and updated config are pushed to a Hub repository
- The quantized checkpoint is uploaded as pytorch_model.bin
This enables anyone to load the quantized model with a single from_pretrained() call, without needing to know the details of the quantization process. The AwqConfig object stores all necessary information for the Transformers library to correctly initialize the quantized model layers.
The export process bridges the gap between the llm-awq quantization pipeline (which produces raw checkpoint files) and the HuggingFace ecosystem (which expects standardized model repositories with config.json and tokenizer files).
Usage
After quantization, to share models on HuggingFace Hub:
- Quantize the model using the AWQ pipeline to produce a checkpoint file
- Create an AwqConfig with the quantization parameters (bits, group_size, zero_point, backend, version)
- Load the original model's config and attach the quantization config
- Push the config and tokenizer to a Hub repository
- Upload the quantized weights as pytorch_model.bin
Related Pages
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
- Doc|HuggingFace AWQ|https://huggingface.co/docs/transformers/quantization/awq
Domains
- Deployment
- Model_Distribution