Principle:InternLM Lmdeploy Pytorch Engine Configuration
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Configuration |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
A configuration pattern that parameterizes the PyTorch inference backend with support for data parallelism, expert parallelism, LoRA adapters, and multi-platform deployment.
Description
PyTorch Engine Configuration extends the engine configuration principle for the PyTorch-based inference backend. It supports features not available in TurboMind:
- Data Parallelism (dp) for throughput scaling across GPU groups
- Expert Parallelism (ep) for Mixture-of-Experts models
- LoRA adapter serving (adapters) for serving multiple fine-tuned variants
- Multi-platform support (device_type: cuda, ascend, maca, camb) for non-NVIDIA hardware
- Disaggregated serving (role: Hybrid, Prefill, Decode) for prefill-decode separation
This is the required backend for SmoothQuant (W8A8) quantized models and models with architectures not yet supported by TurboMind.
Usage
Use this configuration when deploying models on non-NVIDIA hardware, when using SmoothQuant quantization, when serving LoRA adapters, or when the model architecture is only supported by the PyTorch backend. Also required for data-parallel deployments.
Theoretical Basis
The PyTorch backend configuration extends the base engine configuration with additional parallelism dimensions:
- Tensor Parallelism (TP): Splits individual layers across GPUs
- Data Parallelism (DP): Replicates model across GPU groups for throughput
- Expert Parallelism (EP): Distributes MoE experts across GPUs
Pseudo-code:
# Abstract parallelism strategy
if model.is_moe:
config = PytorchConfig(tp=2, dp=2, ep=4) # 16 GPUs total
elif need_lora:
config = PytorchConfig(tp=N, adapters={"adapter1": "/path"})
elif target_device != "cuda":
config = PytorchConfig(device_type=target_device)