Workflow:InternLM Lmdeploy LLM Offline Batch Inference

Knowledge Sources	LMDeploy LMDeploy Docs Pipeline Guide
Domains	LLM_Ops, Inference
Last Updated	2026-02-07 15:00 GMT

Overview

End-to-end process for performing offline batch inference on Large Language Models using the LMDeploy Python pipeline API.

Description

This workflow covers the standard procedure for loading a pre-trained LLM (from HuggingFace Hub, ModelScope, or a local path) and generating text completions for one or more prompts using LMDeploy's high-performance inference pipeline. The pipeline automatically selects the optimal backend (TurboMind or PyTorch), manages KV cache allocation, applies chat templates, and supports features such as tensor parallelism, streaming output, and sampling parameter customization. The output is a list of generated responses with associated metadata (token counts, finish reason).

Usage

Execute this workflow when you have one or more text prompts and need to generate LLM completions locally without deploying an API server. Typical scenarios include batch processing datasets, running evaluations, prototyping prompt templates, or integrating LLM inference into offline data pipelines. Requires a GPU with sufficient VRAM for the target model (e.g., 16GB+ for 7B models at FP16, less with quantization).

Execution Steps

Step 1: Environment Setup

Install LMDeploy in a conda environment with Python 3.10-3.13. The package is available via pip and ships with pre-built CUDA 12 binaries. For RTX 50-series GPUs, install the CUDA 12.8 variant. Verify the installation by importing lmdeploy.

Key considerations:

Ensure CUDA toolkit compatibility with your GPU
Create an isolated conda environment to avoid dependency conflicts
For PyTorch backend usage, also install triton>=2.1.0

Step 2: Model Selection and Configuration

Choose the target model by specifying a HuggingFace model ID (e.g., internlm/internlm3-8b-instruct), a ModelScope model ID, or a local directory path. Optionally configure the inference backend by creating a TurbomindEngineConfig or PytorchEngineConfig with parameters such as tensor parallelism degree, session length, KV cache memory ratio, and maximum batch size.

Key considerations:

LMDeploy auto-selects TurboMind backend by default when the model architecture is supported
The cache_max_entry_count parameter (default 0.8) controls KV cache GPU memory allocation from free memory
Reduce cache_max_entry_count if encountering OOM errors
For tensor parallelism across multiple GPUs, set tp parameter accordingly

Step 3: Pipeline Initialization

Create the inference pipeline by calling the pipeline factory function with the model path and optional backend configuration. The pipeline handles model downloading (if not local), weight loading, chat template detection, tokenizer initialization, and engine startup. The model weights are loaded onto GPU(s) and the KV cache is pre-allocated.

What happens:

Model architecture is detected and mapped to the appropriate backend
Chat template is auto-detected from model config or can be explicitly set via ChatTemplateConfig
Tokenizer is loaded from the model directory
Inference engine is initialized with continuous batching support

Step 4: Prompt Preparation

Format input prompts as either plain strings, lists of strings for batch inference, or OpenAI-format message dictionaries with role/content pairs. For chat models, the pipeline automatically applies the correct chat template to wrap prompts with system messages and special tokens.

Key considerations:

Plain strings are treated as raw prompts
OpenAI-format messages enable multi-turn conversation context
Batch inference accepts a list of prompts for parallel processing

Step 5: Generation Execution

Invoke the pipeline callable with the prepared prompts and an optional GenerationConfig specifying sampling parameters (temperature, top_p, top_k, max_new_tokens). The engine processes prompts through prefill and decode phases using continuous batching for throughput optimization. For streaming use cases, call pipe.stream_infer() to receive tokens incrementally.

What happens:

Prompts are tokenized and scheduled for prefill
KV cache is populated during the prefill phase
Autoregressive token generation runs until stop criteria are met
Results are collected as Response objects with text, token counts, and finish reasons

Step 6: Result Processing and Cleanup

Extract generated text from the Response objects. Optionally access logits or hidden states by configuring output_logits or output_last_hidden_state in GenerationConfig. Release GPU resources by calling pipe.close() or using the pipeline as a context manager with the with statement.

Key considerations:

Always release the pipeline when done to free GPU memory
Use the context manager pattern for automatic cleanup
Response objects contain text, input_token_len, generate_token_len, and finish_reason fields

Execution Diagram

GitHub URL

Workflow Repository