Implementation:Predibase Lorax AWQ Conversion Utils
| Knowledge Sources | |
|---|---|
| Domains | Quantization, Inference |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Provides utility functions for packing, unpacking, and converting 4-bit quantized weight matrices between AWQ and GPTQ formats, enabling AWQ models to use ExLlama GPTQ kernels for inference.
Description
This module implements bitwise operations for manipulating 4-bit integer matrices packed into 32-bit integers, along with format conversion between AWQ and GPTQ packing conventions:
AWQ_PACK_ORDER and REVERSE_AWQ_PACK_ORDER: Constants defining the element order within 32-bit packed words for AWQ format. AWQ uses a non-sequential order [0, 2, 4, 6, 1, 3, 5, 7], and the reverse maps back.
pack: Packs a 4-bit integer matrix into 32-bit integers. Each 32-bit word holds 8 four-bit values. Supports both "column" packing (groups of 8 values along the column dimension) and "row" packing (groups of 8 values along the row dimension) using bitwise left shifts and summation.
unpack: Reverses the packing operation, extracting 4-bit values from 32-bit integers using bitwise right shifts and masking with 0x0F. Supports both column and row directions.
apply_order: Reorders elements within groups of 8 according to a specified order. This is used to convert between AWQ's interleaved packing order and sequential order.
fast_awq_to_gptq: The primary conversion function that transforms AWQ-formatted quantized weights into GPTQ/ExLlama format:
- Unpacks both qweight and qzeros from AWQ column-packed format.
- Applies the reverse AWQ pack order to restore sequential element ordering.
- Subtracts 1 from zeros (GPTQ convention adds 1 to zeros).
- Re-packs zeros in column format and weights in row format (the ExLlama convention).
Usage
This module is used during AWQ model loading to convert AWQ-quantized weight tensors into a format compatible with the ExLlama/GPTQ inference kernels, allowing AWQ models to benefit from the same optimized CUDA kernels used for GPTQ inference.
Code Reference
Source Location
- Repository: Predibase_Lorax
- File: server/lorax_server/layers/awq/conversion_utils.py
- Lines: 1-93
Signature
def pack(imatrix: torch.Tensor, direction: str = "column") -> torch.Tensor:
def unpack(qmatrix: torch.Tensor, direction: str = "column") -> torch.Tensor:
def apply_order(imatrix: torch.Tensor, direction: str = "column",
order: List[int] = AWQ_PACK_ORDER) -> torch.Tensor:
def fast_awq_to_gptq(qweight, qzeros) -> tuple:
Import
from lorax_server.layers.awq.conversion_utils import fast_awq_to_gptq, pack, unpack
I/O Contract
Inputs (fast_awq_to_gptq)
| Name | Type | Required | Description |
|---|---|---|---|
| qweight | torch.Tensor (int32) | Yes | AWQ column-packed quantized weight matrix |
| qzeros | torch.Tensor (int32) | Yes | AWQ column-packed quantized zero points |
Outputs (fast_awq_to_gptq)
| Name | Type | Description |
|---|---|---|
| qweight | torch.Tensor (int32) | GPTQ/ExLlama row-packed quantized weight matrix |
| qzeros | torch.Tensor (int32) | GPTQ/ExLlama column-packed quantized zero points (shifted by -1) |
Inputs (pack)
| Name | Type | Required | Description |
|---|---|---|---|
| imatrix | torch.Tensor | Yes | 4-bit integer matrix to pack |
| direction | str | No | Packing direction: "column" (default) or "row" |
Outputs (pack)
| Name | Type | Description |
|---|---|---|
| qmatrix | torch.Tensor (int32) | Packed 32-bit integer matrix |
Usage Examples
# Used during AWQ model loading for format conversion
from lorax_server.layers.awq.conversion_utils import fast_awq_to_gptq
gptq_qweight, gptq_qzeros = fast_awq_to_gptq(awq_qweight, awq_qzeros)