Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:InternLM Lmdeploy LLM Offline Batch Inference

From Leeroopedia
Revision as of 11:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/InternLM_Lmdeploy_LLM_Offline_Batch_Inference.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains LLM_Ops, Inference
Last Updated 2026-02-07 15:00 GMT

Overview

End-to-end process for performing offline batch inference on Large Language Models using the LMDeploy Python pipeline API.

Description

This workflow covers the standard procedure for loading a pre-trained LLM (from HuggingFace Hub, ModelScope, or a local path) and generating text completions for one or more prompts using LMDeploy's high-performance inference pipeline. The pipeline automatically selects the optimal backend (TurboMind or PyTorch), manages KV cache allocation, applies chat templates, and supports features such as tensor parallelism, streaming output, and sampling parameter customization. The output is a list of generated responses with associated metadata (token counts, finish reason).

Usage

Execute this workflow when you have one or more text prompts and need to generate LLM completions locally without deploying an API server. Typical scenarios include batch processing datasets, running evaluations, prototyping prompt templates, or integrating LLM inference into offline data pipelines. Requires a GPU with sufficient VRAM for the target model (e.g., 16GB+ for 7B models at FP16, less with quantization).

Execution Steps

Step 1: Environment Setup

Install LMDeploy in a conda environment with Python 3.10-3.13. The package is available via pip and ships with pre-built CUDA 12 binaries. For RTX 50-series GPUs, install the CUDA 12.8 variant. Verify the installation by importing lmdeploy.

Key considerations:

  • Ensure CUDA toolkit compatibility with your GPU
  • Create an isolated conda environment to avoid dependency conflicts
  • For PyTorch backend usage, also install triton>=2.1.0

Step 2: Model Selection and Configuration

Choose the target model by specifying a HuggingFace model ID (e.g., internlm/internlm3-8b-instruct), a ModelScope model ID, or a local directory path. Optionally configure the inference backend by creating a TurbomindEngineConfig or PytorchEngineConfig with parameters such as tensor parallelism degree, session length, KV cache memory ratio, and maximum batch size.

Key considerations:

  • LMDeploy auto-selects TurboMind backend by default when the model architecture is supported
  • The cache_max_entry_count parameter (default 0.8) controls KV cache GPU memory allocation from free memory
  • Reduce cache_max_entry_count if encountering OOM errors
  • For tensor parallelism across multiple GPUs, set tp parameter accordingly

Step 3: Pipeline Initialization

Create the inference pipeline by calling the pipeline factory function with the model path and optional backend configuration. The pipeline handles model downloading (if not local), weight loading, chat template detection, tokenizer initialization, and engine startup. The model weights are loaded onto GPU(s) and the KV cache is pre-allocated.

What happens:

  • Model architecture is detected and mapped to the appropriate backend
  • Chat template is auto-detected from model config or can be explicitly set via ChatTemplateConfig
  • Tokenizer is loaded from the model directory
  • Inference engine is initialized with continuous batching support

Step 4: Prompt Preparation

Format input prompts as either plain strings, lists of strings for batch inference, or OpenAI-format message dictionaries with role/content pairs. For chat models, the pipeline automatically applies the correct chat template to wrap prompts with system messages and special tokens.

Key considerations:

  • Plain strings are treated as raw prompts
  • OpenAI-format messages enable multi-turn conversation context
  • Batch inference accepts a list of prompts for parallel processing

Step 5: Generation Execution

Invoke the pipeline callable with the prepared prompts and an optional GenerationConfig specifying sampling parameters (temperature, top_p, top_k, max_new_tokens). The engine processes prompts through prefill and decode phases using continuous batching for throughput optimization. For streaming use cases, call pipe.stream_infer() to receive tokens incrementally.

What happens:

  • Prompts are tokenized and scheduled for prefill
  • KV cache is populated during the prefill phase
  • Autoregressive token generation runs until stop criteria are met
  • Results are collected as Response objects with text, token counts, and finish reasons

Step 6: Result Processing and Cleanup

Extract generated text from the Response objects. Optionally access logits or hidden states by configuring output_logits or output_last_hidden_state in GenerationConfig. Release GPU resources by calling pipe.close() or using the pipeline as a context manager with the with statement.

Key considerations:

  • Always release the pipeline when done to free GPU memory
  • Use the context manager pattern for automatic cleanup
  • Response objects contain text, input_token_len, generate_token_len, and finish_reason fields

Execution Diagram

GitHub URL

Workflow Repository