Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Predibase Lorax AWQ Conversion Utils

From Leeroopedia


Knowledge Sources
Domains Quantization, Inference
Last Updated 2026-02-08 00:00 GMT

Overview

Provides utility functions for packing, unpacking, and converting 4-bit quantized weight matrices between AWQ and GPTQ formats, enabling AWQ models to use ExLlama GPTQ kernels for inference.

Description

This module implements bitwise operations for manipulating 4-bit integer matrices packed into 32-bit integers, along with format conversion between AWQ and GPTQ packing conventions:

AWQ_PACK_ORDER and REVERSE_AWQ_PACK_ORDER: Constants defining the element order within 32-bit packed words for AWQ format. AWQ uses a non-sequential order [0, 2, 4, 6, 1, 3, 5, 7], and the reverse maps back.

pack: Packs a 4-bit integer matrix into 32-bit integers. Each 32-bit word holds 8 four-bit values. Supports both "column" packing (groups of 8 values along the column dimension) and "row" packing (groups of 8 values along the row dimension) using bitwise left shifts and summation.

unpack: Reverses the packing operation, extracting 4-bit values from 32-bit integers using bitwise right shifts and masking with 0x0F. Supports both column and row directions.

apply_order: Reorders elements within groups of 8 according to a specified order. This is used to convert between AWQ's interleaved packing order and sequential order.

fast_awq_to_gptq: The primary conversion function that transforms AWQ-formatted quantized weights into GPTQ/ExLlama format:

  1. Unpacks both qweight and qzeros from AWQ column-packed format.
  2. Applies the reverse AWQ pack order to restore sequential element ordering.
  3. Subtracts 1 from zeros (GPTQ convention adds 1 to zeros).
  4. Re-packs zeros in column format and weights in row format (the ExLlama convention).

Usage

This module is used during AWQ model loading to convert AWQ-quantized weight tensors into a format compatible with the ExLlama/GPTQ inference kernels, allowing AWQ models to benefit from the same optimized CUDA kernels used for GPTQ inference.

Code Reference

Source Location

  • Repository: Predibase_Lorax
  • File: server/lorax_server/layers/awq/conversion_utils.py
  • Lines: 1-93

Signature

def pack(imatrix: torch.Tensor, direction: str = "column") -> torch.Tensor:

def unpack(qmatrix: torch.Tensor, direction: str = "column") -> torch.Tensor:

def apply_order(imatrix: torch.Tensor, direction: str = "column",
                order: List[int] = AWQ_PACK_ORDER) -> torch.Tensor:

def fast_awq_to_gptq(qweight, qzeros) -> tuple:

Import

from lorax_server.layers.awq.conversion_utils import fast_awq_to_gptq, pack, unpack

I/O Contract

Inputs (fast_awq_to_gptq)

Name Type Required Description
qweight torch.Tensor (int32) Yes AWQ column-packed quantized weight matrix
qzeros torch.Tensor (int32) Yes AWQ column-packed quantized zero points

Outputs (fast_awq_to_gptq)

Name Type Description
qweight torch.Tensor (int32) GPTQ/ExLlama row-packed quantized weight matrix
qzeros torch.Tensor (int32) GPTQ/ExLlama column-packed quantized zero points (shifted by -1)

Inputs (pack)

Name Type Required Description
imatrix torch.Tensor Yes 4-bit integer matrix to pack
direction str No Packing direction: "column" (default) or "row"

Outputs (pack)

Name Type Description
qmatrix torch.Tensor (int32) Packed 32-bit integer matrix

Usage Examples

# Used during AWQ model loading for format conversion
from lorax_server.layers.awq.conversion_utils import fast_awq_to_gptq

gptq_qweight, gptq_qzeros = fast_awq_to_gptq(awq_qweight, awq_qzeros)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment