Principle:InternLM Lmdeploy Pytorch Engine Configuration

Knowledge Sources	LMDeploy Docs LMDeploy
Domains	LLM_Inference, Configuration
Last Updated	2026-02-07 15:00 GMT

Overview

A configuration pattern that parameterizes the PyTorch inference backend with support for data parallelism, expert parallelism, LoRA adapters, and multi-platform deployment.

Description

PyTorch Engine Configuration extends the engine configuration principle for the PyTorch-based inference backend. It supports features not available in TurboMind:

Data Parallelism (dp) for throughput scaling across GPU groups
Expert Parallelism (ep) for Mixture-of-Experts models
LoRA adapter serving (adapters) for serving multiple fine-tuned variants
Multi-platform support (device_type: cuda, ascend, maca, camb) for non-NVIDIA hardware
Disaggregated serving (role: Hybrid, Prefill, Decode) for prefill-decode separation

This is the required backend for SmoothQuant (W8A8) quantized models and models with architectures not yet supported by TurboMind.

Usage

Use this configuration when deploying models on non-NVIDIA hardware, when using SmoothQuant quantization, when serving LoRA adapters, or when the model architecture is only supported by the PyTorch backend. Also required for data-parallel deployments.

Theoretical Basis

The PyTorch backend configuration extends the base engine configuration with additional parallelism dimensions:

Tensor Parallelism (TP): Splits individual layers across GPUs
Data Parallelism (DP): Replicates model across GPU groups for throughput
Expert Parallelism (EP): Distributes MoE experts across GPUs

Pseudo-code:

# Abstract parallelism strategy
if model.is_moe:
    config = PytorchConfig(tp=2, dp=2, ep=4)  # 16 GPUs total
elif need_lora:
    config = PytorchConfig(tp=N, adapters={"adapter1": "/path"})
elif target_device != "cuda":
    config = PytorchConfig(device_type=target_device)

Related Pages

Implemented By

Implementation:InternLM_Lmdeploy_PytorchEngineConfig

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment