Workflow:InternLM Lmdeploy W4A16 AWQ Quantization

Knowledge Sources	LMDeploy LMDeploy Docs W4A16 Guide AWQ
Domains	LLM_Ops, Quantization, Inference
Last Updated	2026-02-07 15:00 GMT

Overview

End-to-end process for applying 4-bit weight quantization (W4A16) using the AWQ algorithm to compress LLMs for memory-efficient inference with LMDeploy's TurboMind engine.

Description

This workflow covers the complete pipeline from selecting a pre-trained model to quantizing its weights to 4-bit integers using the Activation-aware Weight Quantization (AWQ) algorithm, then deploying the quantized model for inference or serving. AWQ preserves important weight channels by analyzing activation distributions on a calibration dataset, achieving significant memory reduction (approximately 4x) with minimal accuracy loss. The quantized model runs on the TurboMind backend which provides optimized INT4 GEMM kernels for NVIDIA GPUs from Volta (V100) to Ada Lovelace (RTX 40-series).

Usage

Execute this workflow when you need to deploy a large model on hardware with limited GPU memory, or when you want to increase inference throughput by reducing memory bandwidth requirements. Typical scenarios include deploying 7B+ models on consumer GPUs (e.g., RTX 3090 with 24GB), serving larger models (13B-70B) on datacenter GPUs, or reducing serving costs by fitting more concurrent requests in the same GPU memory.

Execution Steps

Step 1: Environment Setup

Install LMDeploy with all dependencies. The quantization module requires access to the full model weights (FP16/BF16) and sufficient CPU/GPU memory for the calibration process. Ensure CUDA compatibility with the target GPU architecture (sm70 through sm89).

Key considerations:

Quantization requires loading the full model, so ensure sufficient memory
The calibration process is GPU-intensive; reduce --calib-seqlen or increase --calib-samples if OOM occurs
Set --batch-size to 1 if GPU memory is limited during calibration

Step 2: Calibration Dataset Preparation

The AWQ algorithm requires a calibration dataset to determine which weight channels are most important based on activation magnitudes. LMDeploy uses WikiText-2 by default with 128 samples and sequence length of 2048. Custom calibration datasets can be specified for domain-specific quantization.

Key considerations:

Default calibration dataset (wikitext2) works well for general-purpose models
Calibration samples and sequence length affect quantization quality vs. speed
More calibration samples generally improve quantization quality

Step 3: Weight Quantization

Run the AWQ quantization using the lmdeploy lite auto_awq CLI command. This analyzes activation patterns across calibration samples, identifies salient weight channels, applies per-group (default group_size=128) 4-bit quantization with scale factors, and saves the quantized model weights and configuration to the output directory.

What happens:

Model weights are loaded in full precision
Calibration data is passed through the model to collect activation statistics
AWQ determines per-channel importance scores from activation magnitudes
Weights are quantized to 4-bit with per-group scale factors (128 elements per group)
Quantized weights and metadata are saved to the work directory
Optionally, --search-scale enables scale search for improved accuracy (slower)

Step 4: Quantization Validation

Verify the quantized model by running a quick chat session using lmdeploy chat with the quantized model directory. Check that responses are coherent and comparable to the original model. For rigorous evaluation, use OpenCompass benchmarks to measure accuracy degradation.

Key considerations:

Quick validation via interactive chat catches obvious quality issues
Formal evaluation with OpenCompass provides quantitative accuracy metrics
If quality is poor, re-quantize with --search-scale enabled and larger --batch-size

Step 5: Quantized Model Inference

Deploy the quantized model using the LMDeploy pipeline API or API server. Configure the engine with model_format='awq' to activate the INT4 GEMM kernels in TurboMind. The quantized model consumes approximately 4x less GPU memory for weights, enabling larger batch sizes or deployment on smaller GPUs.

Key considerations:

Set model_format='awq' in TurbomindEngineConfig for local quantized models
Pre-quantized models from HuggingFace Hub (lmdeploy or TheBloke spaces) are auto-detected
INT4 inference achieves up to 2.4x throughput improvement over FP16 on supported GPUs

Step 6: Quantized Model Serving

Optionally deploy the quantized model as an OpenAI-compatible API server using the lmdeploy serve command with --model-format awq. This combines the memory savings of quantization with the serving infrastructure for production use.

Key considerations:

Pass --backend turbomind --model-format awq to the serve command
All standard API server features (streaming, multi-turn, function calling) work with quantized models
Tensor parallelism is supported for multi-GPU serving of quantized models

Execution Diagram

GitHub URL

Workflow Repository