Implementation:Predibase Lorax AWQ Conversion Utils

Knowledge Sources	Predibase_Lorax
Domains	Quantization, Inference
Last Updated	2026-02-08 00:00 GMT

Overview

Provides utility functions for packing, unpacking, and converting 4-bit quantized weight matrices between AWQ and GPTQ formats, enabling AWQ models to use ExLlama GPTQ kernels for inference.

Description

This module implements bitwise operations for manipulating 4-bit integer matrices packed into 32-bit integers, along with format conversion between AWQ and GPTQ packing conventions:

AWQ_PACK_ORDER and REVERSE_AWQ_PACK_ORDER: Constants defining the element order within 32-bit packed words for AWQ format. AWQ uses a non-sequential order [0, 2, 4, 6, 1, 3, 5, 7], and the reverse maps back.

pack: Packs a 4-bit integer matrix into 32-bit integers. Each 32-bit word holds 8 four-bit values. Supports both "column" packing (groups of 8 values along the column dimension) and "row" packing (groups of 8 values along the row dimension) using bitwise left shifts and summation.

unpack: Reverses the packing operation, extracting 4-bit values from 32-bit integers using bitwise right shifts and masking with 0x0F. Supports both column and row directions.

apply_order: Reorders elements within groups of 8 according to a specified order. This is used to convert between AWQ's interleaved packing order and sequential order.

fast_awq_to_gptq: The primary conversion function that transforms AWQ-formatted quantized weights into GPTQ/ExLlama format:

Unpacks both qweight and qzeros from AWQ column-packed format.
Applies the reverse AWQ pack order to restore sequential element ordering.
Subtracts 1 from zeros (GPTQ convention adds 1 to zeros).
Re-packs zeros in column format and weights in row format (the ExLlama convention).

Usage

This module is used during AWQ model loading to convert AWQ-quantized weight tensors into a format compatible with the ExLlama/GPTQ inference kernels, allowing AWQ models to benefit from the same optimized CUDA kernels used for GPTQ inference.

Code Reference

Source Location

Repository: Predibase_Lorax
File: server/lorax_server/layers/awq/conversion_utils.py
Lines: 1-93

Signature

def pack(imatrix: torch.Tensor, direction: str = "column") -> torch.Tensor:

def unpack(qmatrix: torch.Tensor, direction: str = "column") -> torch.Tensor:

def apply_order(imatrix: torch.Tensor, direction: str = "column",
                order: List[int] = AWQ_PACK_ORDER) -> torch.Tensor:

def fast_awq_to_gptq(qweight, qzeros) -> tuple:

Import

from lorax_server.layers.awq.conversion_utils import fast_awq_to_gptq, pack, unpack

I/O Contract

Inputs (fast_awq_to_gptq)

Name	Type	Required	Description
qweight	torch.Tensor (int32)	Yes	AWQ column-packed quantized weight matrix
qzeros	torch.Tensor (int32)	Yes	AWQ column-packed quantized zero points

Outputs (fast_awq_to_gptq)

Name	Type	Description
qweight	torch.Tensor (int32)	GPTQ/ExLlama row-packed quantized weight matrix
qzeros	torch.Tensor (int32)	GPTQ/ExLlama column-packed quantized zero points (shifted by -1)

Inputs (pack)

Name	Type	Required	Description
imatrix	torch.Tensor	Yes	4-bit integer matrix to pack
direction	str	No	Packing direction: "column" (default) or "row"

Outputs (pack)

Name	Type	Description
qmatrix	torch.Tensor (int32)	Packed 32-bit integer matrix

Usage Examples

# Used during AWQ model loading for format conversion
from lorax_server.layers.awq.conversion_utils import fast_awq_to_gptq

gptq_qweight, gptq_qzeros = fast_awq_to_gptq(awq_qweight, awq_qzeros)

Related Pages

Environment:Predibase_Lorax_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment