Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:InternLM Lmdeploy Quantized Model Inference

From Leeroopedia


Knowledge Sources
Domains LLM_Inference, Quantization
Last Updated 2026-02-07 15:00 GMT

Overview

An inference pattern for running quantized (AWQ/GPTQ) models through the TurboMind backend with explicit model format specification.

Description

Quantized Model Inference is the standard pattern for deploying AWQ or GPTQ quantized models. The critical difference from full-precision inference is that the engine must be told the weight format via the model_format parameter in TurbomindEngineConfig.

The TurboMind backend uses optimized INT4 GEMM kernels for AWQ/GPTQ models, providing significant throughput improvements. The inference API remains identical to full-precision models, meaning no code changes beyond configuration.

Usage

Use this when deploying AWQ or GPTQ quantized models for inference. Set model_format='awq' or model_format='gptq' in TurbomindEngineConfig. The rest of the pipeline API is unchanged.

Theoretical Basis

Quantized inference uses integer arithmetic for the compute-bound GEMM operations:

# Abstract quantized inference
config = EngineConfig(model_format='awq')
pipe = pipeline(quantized_model_path, backend_config=config)
# Uses INT4 GEMM kernels internally
# Same API as full-precision inference
response = pipe(prompts)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment