Workflow:InternLM Lmdeploy LLM API Server Deployment

Knowledge Sources	LMDeploy LMDeploy Docs API Server Guide
Domains	LLM_Ops, Model_Serving, Inference
Last Updated	2026-02-07 15:00 GMT

Overview

End-to-end process for deploying a Large Language Model as an OpenAI-compatible HTTP API server using LMDeploy.

Description

This workflow covers launching an LLM as a production-ready REST API service that exposes OpenAI-compatible endpoints (/v1/chat/completions, /v1/completions, /v1/models). It supports deployment via the lmdeploy CLI, Docker containers, or Kubernetes clusters. The server leverages continuous batching, tensor parallelism, and KV cache management for high-throughput serving. Clients interact with the service using the standard OpenAI Python SDK, cURL, or LMDeploy's built-in APIClient.

Usage

Execute this workflow when you need to serve an LLM model over HTTP for integration with applications, chatbots, or multi-user access. This is the standard approach for production deployments where multiple clients send concurrent requests. Suitable when you need an OpenAI-compatible API drop-in replacement using a self-hosted model.

Execution Steps

Step 1: Environment and Infrastructure Setup

Install LMDeploy and verify GPU availability. For bare-metal deployments, install via pip in a conda environment. For containerized deployments, pull the official Docker image (openmmlab/lmdeploy:latest). For Kubernetes, prepare deployment manifests with GPU resource requests and persistent volume claims for model storage.

Key considerations:

Docker deployment requires --runtime nvidia and --gpus all flags
Kubernetes requires NVIDIA device plugin for GPU scheduling
Set HUGGING_FACE_HUB_TOKEN environment variable for gated models

Step 2: Model Selection and Engine Configuration

Choose the target model and configure engine parameters. Key settings include tensor parallelism (--tp), session length (--session-len), KV cache ratio (--cache-max-entry-count), and maximum batch size. Select the backend (turbomind or pytorch) based on model architecture compatibility and hardware.

Key considerations:

TurboMind backend provides highest performance on NVIDIA GPUs
PyTorch backend offers broader model compatibility and multi-platform support
For quantized models, specify --model-format (awq, gptq, etc.)
Session length determines maximum context window for all requests

Step 3: Server Launch

Start the API server using the lmdeploy CLI command with the model path and configuration options. The server binds to a configurable host and port (default 0.0.0.0:23333). The model is loaded, engine initialized, and the HTTP server starts accepting requests. A Swagger UI is available at the root URL for API exploration.

What happens:

Model weights are loaded onto GPU(s) with optional quantization
KV cache memory is pre-allocated based on configuration
FastAPI server starts with OpenAI-compatible route handlers
Server registers with proxy server if --proxy-url is specified

Step 4: Client Integration

Connect to the running server using the OpenAI Python SDK, LMDeploy's APIClient, cURL, or any HTTP client. Configure the client with the server's base URL and any API keys. Send chat completion or text completion requests with model name, messages, and sampling parameters.

Key considerations:

Use client.models.list() to discover the served model name
Streaming responses are supported via stream=True parameter
API keys can be enforced via --api-keys server argument
Async clients (AsyncOpenAI) are supported for concurrent requests

Step 5: Production Hardening

For production deployments, configure health monitoring, load balancing, and scaling. Use the proxy server (lmdeploy serve proxy) to distribute requests across multiple model instances. Deploy behind a reverse proxy for TLS termination. Monitor server metrics and adjust engine parameters for optimal throughput.

Key considerations:

Use torchrun for launching multiple API servers with tensor parallelism
Proxy server enables load balancing across multiple model replicas
Kubernetes services provide automatic load balancing and scaling
Monitor for OOM errors and adjust cache_max_entry_count accordingly

Execution Diagram

GitHub URL

Workflow Repository