Workflow:Mit han lab Llm awq HuggingFace Model Export
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Model_Distribution, Quantization |
| Last Updated | 2025-04-01 00:00 GMT |
Overview
End-to-end process for converting AWQ-quantized model checkpoints into HuggingFace-compatible format and publishing them to the HuggingFace Hub.
Description
This workflow converts AWQ quantized weights from the native .pt format into HuggingFace-compatible model repositories. Since transformers 4.34+, HuggingFace natively supports loading AWQ-quantized models via the AwqConfig quantization configuration. This workflow creates the proper AwqConfig metadata, attaches it to the model configuration, packages the tokenizer, and uploads the quantized weights to a HuggingFace Hub repository, making the quantized model available for easy download and inference via the standard from_pretrained API.
Usage
Execute this workflow after you have a real-quantized AWQ checkpoint (.pt file) and want to share the quantized model through the HuggingFace Hub or use it with HuggingFace-native inference tools (such as vLLM, TGI, or transformers pipelines) that support the AWQ format.
Execution Steps
Step 1: Authenticate with HuggingFace Hub
Ensure you are logged in to the HuggingFace Hub using the CLI tool. This is required to create repositories and push model files to the Hub.
Key considerations:
- Run huggingface-cli login before executing the conversion script
- Requires a write-access token for the target Hub organization or user account
Step 2: Create AWQ Quantization Config
Construct the AwqConfig object specifying the quantization parameters used during the AWQ process. This includes the bit width (typically 4), group size (typically 128), zero-point setting, backend identifier, and GEMM version. This configuration is attached to the model's config.json so that HuggingFace transformers can automatically detect and load the quantized model correctly.
Key considerations:
- The backend should be set to "llm-awq" to indicate the source quantization library
- The version field specifies the GEMM kernel variant (e.g., "gemv")
- This metadata enables automatic quantized loading via from_pretrained
Step 3: Load and Update Model Configuration
Load the AutoConfig from the original (unquantized) model path and attach the AwqConfig as the quantization_config attribute. Also load the tokenizer from the original model. These are pushed to the target Hub repository first, establishing the model card and tokenizer files.
Key considerations:
- The original model path provides the base config.json and tokenizer files
- Only the quantization_config field is added; all other model configuration remains unchanged
Step 4: Upload Quantized Weights
Upload the quantized weights file (.pt) to the Hub repository as pytorch_model.bin. The HuggingFace API handles the upload, including large file storage (LFS) for the binary weights file.
Key considerations:
- The weights file is uploaded as a single pytorch_model.bin file
- Single GPU mode is used during this process
Step 5: Verify Hub Repository
After upload, the Hub repository contains config.json (with quantization_config), tokenizer files, and pytorch_model.bin. Users can now load the quantized model with a single from_pretrained call using the HuggingFace transformers library.