Workflow:Mit han lab Llm awq HuggingFace Model Export

Knowledge Sources	llm-awq HuggingFace AWQ Integration
Domains	LLMs, Model_Distribution, Quantization
Last Updated	2025-04-01 00:00 GMT

Overview

End-to-end process for converting AWQ-quantized model checkpoints into HuggingFace-compatible format and publishing them to the HuggingFace Hub.

Description

This workflow converts AWQ quantized weights from the native .pt format into HuggingFace-compatible model repositories. Since transformers 4.34+, HuggingFace natively supports loading AWQ-quantized models via the AwqConfig quantization configuration. This workflow creates the proper AwqConfig metadata, attaches it to the model configuration, packages the tokenizer, and uploads the quantized weights to a HuggingFace Hub repository, making the quantized model available for easy download and inference via the standard from_pretrained API.

Usage

Execute this workflow after you have a real-quantized AWQ checkpoint (.pt file) and want to share the quantized model through the HuggingFace Hub or use it with HuggingFace-native inference tools (such as vLLM, TGI, or transformers pipelines) that support the AWQ format.

Execution Steps

Step 1: Authenticate with HuggingFace Hub

Ensure you are logged in to the HuggingFace Hub using the CLI tool. This is required to create repositories and push model files to the Hub.

Key considerations:

Run huggingface-cli login before executing the conversion script
Requires a write-access token for the target Hub organization or user account

Step 2: Create AWQ Quantization Config

Construct the AwqConfig object specifying the quantization parameters used during the AWQ process. This includes the bit width (typically 4), group size (typically 128), zero-point setting, backend identifier, and GEMM version. This configuration is attached to the model's config.json so that HuggingFace transformers can automatically detect and load the quantized model correctly.

Key considerations:

The backend should be set to "llm-awq" to indicate the source quantization library
The version field specifies the GEMM kernel variant (e.g., "gemv")
This metadata enables automatic quantized loading via from_pretrained

Step 3: Load and Update Model Configuration

Load the AutoConfig from the original (unquantized) model path and attach the AwqConfig as the quantization_config attribute. Also load the tokenizer from the original model. These are pushed to the target Hub repository first, establishing the model card and tokenizer files.

Key considerations:

The original model path provides the base config.json and tokenizer files
Only the quantization_config field is added; all other model configuration remains unchanged

Step 4: Upload Quantized Weights

Upload the quantized weights file (.pt) to the Hub repository as pytorch_model.bin. The HuggingFace API handles the upload, including large file storage (LFS) for the binary weights file.

Key considerations:

The weights file is uploaded as a single pytorch_model.bin file
Single GPU mode is used during this process

Step 5: Verify Hub Repository

After upload, the Hub repository contains config.json (with quantization_config), tokenizer files, and pytorch_model.bin. Users can now load the quantized model with a single from_pretrained call using the HuggingFace transformers library.

Execution Diagram

GitHub URL

Workflow Repository