Workflow:InternLM Lmdeploy W4A16 AWQ Quantization
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Quantization, Inference |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
End-to-end process for applying 4-bit weight quantization (W4A16) using the AWQ algorithm to compress LLMs for memory-efficient inference with LMDeploy's TurboMind engine.
Description
This workflow covers the complete pipeline from selecting a pre-trained model to quantizing its weights to 4-bit integers using the Activation-aware Weight Quantization (AWQ) algorithm, then deploying the quantized model for inference or serving. AWQ preserves important weight channels by analyzing activation distributions on a calibration dataset, achieving significant memory reduction (approximately 4x) with minimal accuracy loss. The quantized model runs on the TurboMind backend which provides optimized INT4 GEMM kernels for NVIDIA GPUs from Volta (V100) to Ada Lovelace (RTX 40-series).
Usage
Execute this workflow when you need to deploy a large model on hardware with limited GPU memory, or when you want to increase inference throughput by reducing memory bandwidth requirements. Typical scenarios include deploying 7B+ models on consumer GPUs (e.g., RTX 3090 with 24GB), serving larger models (13B-70B) on datacenter GPUs, or reducing serving costs by fitting more concurrent requests in the same GPU memory.
Execution Steps
Step 1: Environment Setup
Install LMDeploy with all dependencies. The quantization module requires access to the full model weights (FP16/BF16) and sufficient CPU/GPU memory for the calibration process. Ensure CUDA compatibility with the target GPU architecture (sm70 through sm89).
Key considerations:
- Quantization requires loading the full model, so ensure sufficient memory
- The calibration process is GPU-intensive; reduce --calib-seqlen or increase --calib-samples if OOM occurs
- Set --batch-size to 1 if GPU memory is limited during calibration
Step 2: Calibration Dataset Preparation
The AWQ algorithm requires a calibration dataset to determine which weight channels are most important based on activation magnitudes. LMDeploy uses WikiText-2 by default with 128 samples and sequence length of 2048. Custom calibration datasets can be specified for domain-specific quantization.
Key considerations:
- Default calibration dataset (wikitext2) works well for general-purpose models
- Calibration samples and sequence length affect quantization quality vs. speed
- More calibration samples generally improve quantization quality
Step 3: Weight Quantization
Run the AWQ quantization using the lmdeploy lite auto_awq CLI command. This analyzes activation patterns across calibration samples, identifies salient weight channels, applies per-group (default group_size=128) 4-bit quantization with scale factors, and saves the quantized model weights and configuration to the output directory.
What happens:
- Model weights are loaded in full precision
- Calibration data is passed through the model to collect activation statistics
- AWQ determines per-channel importance scores from activation magnitudes
- Weights are quantized to 4-bit with per-group scale factors (128 elements per group)
- Quantized weights and metadata are saved to the work directory
- Optionally, --search-scale enables scale search for improved accuracy (slower)
Step 4: Quantization Validation
Verify the quantized model by running a quick chat session using lmdeploy chat with the quantized model directory. Check that responses are coherent and comparable to the original model. For rigorous evaluation, use OpenCompass benchmarks to measure accuracy degradation.
Key considerations:
- Quick validation via interactive chat catches obvious quality issues
- Formal evaluation with OpenCompass provides quantitative accuracy metrics
- If quality is poor, re-quantize with --search-scale enabled and larger --batch-size
Step 5: Quantized Model Inference
Deploy the quantized model using the LMDeploy pipeline API or API server. Configure the engine with model_format='awq' to activate the INT4 GEMM kernels in TurboMind. The quantized model consumes approximately 4x less GPU memory for weights, enabling larger batch sizes or deployment on smaller GPUs.
Key considerations:
- Set model_format='awq' in TurbomindEngineConfig for local quantized models
- Pre-quantized models from HuggingFace Hub (lmdeploy or TheBloke spaces) are auto-detected
- INT4 inference achieves up to 2.4x throughput improvement over FP16 on supported GPUs
Step 6: Quantized Model Serving
Optionally deploy the quantized model as an OpenAI-compatible API server using the lmdeploy serve command with --model-format awq. This combines the memory savings of quantization with the serving infrastructure for production use.
Key considerations:
- Pass --backend turbomind --model-format awq to the serve command
- All standard API server features (streaming, multi-turn, function calling) work with quantized models
- Tensor parallelism is supported for multi-GPU serving of quantized models