Workflow:InternLM Lmdeploy LLM API Server Deployment
| Knowledge Sources | |
|---|---|
| Domains | LLM_Ops, Model_Serving, Inference |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
End-to-end process for deploying a Large Language Model as an OpenAI-compatible HTTP API server using LMDeploy.
Description
This workflow covers launching an LLM as a production-ready REST API service that exposes OpenAI-compatible endpoints (/v1/chat/completions, /v1/completions, /v1/models). It supports deployment via the lmdeploy CLI, Docker containers, or Kubernetes clusters. The server leverages continuous batching, tensor parallelism, and KV cache management for high-throughput serving. Clients interact with the service using the standard OpenAI Python SDK, cURL, or LMDeploy's built-in APIClient.
Usage
Execute this workflow when you need to serve an LLM model over HTTP for integration with applications, chatbots, or multi-user access. This is the standard approach for production deployments where multiple clients send concurrent requests. Suitable when you need an OpenAI-compatible API drop-in replacement using a self-hosted model.
Execution Steps
Step 1: Environment and Infrastructure Setup
Install LMDeploy and verify GPU availability. For bare-metal deployments, install via pip in a conda environment. For containerized deployments, pull the official Docker image (openmmlab/lmdeploy:latest). For Kubernetes, prepare deployment manifests with GPU resource requests and persistent volume claims for model storage.
Key considerations:
- Docker deployment requires --runtime nvidia and --gpus all flags
- Kubernetes requires NVIDIA device plugin for GPU scheduling
- Set HUGGING_FACE_HUB_TOKEN environment variable for gated models
Step 2: Model Selection and Engine Configuration
Choose the target model and configure engine parameters. Key settings include tensor parallelism (--tp), session length (--session-len), KV cache ratio (--cache-max-entry-count), and maximum batch size. Select the backend (turbomind or pytorch) based on model architecture compatibility and hardware.
Key considerations:
- TurboMind backend provides highest performance on NVIDIA GPUs
- PyTorch backend offers broader model compatibility and multi-platform support
- For quantized models, specify --model-format (awq, gptq, etc.)
- Session length determines maximum context window for all requests
Step 3: Server Launch
Start the API server using the lmdeploy CLI command with the model path and configuration options. The server binds to a configurable host and port (default 0.0.0.0:23333). The model is loaded, engine initialized, and the HTTP server starts accepting requests. A Swagger UI is available at the root URL for API exploration.
What happens:
- Model weights are loaded onto GPU(s) with optional quantization
- KV cache memory is pre-allocated based on configuration
- FastAPI server starts with OpenAI-compatible route handlers
- Server registers with proxy server if --proxy-url is specified
Step 4: Client Integration
Connect to the running server using the OpenAI Python SDK, LMDeploy's APIClient, cURL, or any HTTP client. Configure the client with the server's base URL and any API keys. Send chat completion or text completion requests with model name, messages, and sampling parameters.
Key considerations:
- Use client.models.list() to discover the served model name
- Streaming responses are supported via stream=True parameter
- API keys can be enforced via --api-keys server argument
- Async clients (AsyncOpenAI) are supported for concurrent requests
Step 5: Production Hardening
For production deployments, configure health monitoring, load balancing, and scaling. Use the proxy server (lmdeploy serve proxy) to distribute requests across multiple model instances. Deploy behind a reverse proxy for TLS termination. Monitor server metrics and adjust engine parameters for optimal throughput.
Key considerations:
- Use torchrun for launching multiple API servers with tensor parallelism
- Proxy server enables load balancing across multiple model replicas
- Kubernetes services provide automatic load balancing and scaling
- Monitor for OOM errors and adjust cache_max_entry_count accordingly