Implementation:InternLM Lmdeploy Serve Api Server

Knowledge Sources	LMDeploy API Server Guide
Domains	LLM_Serving, REST_API
Last Updated	2026-02-07 15:00 GMT

Overview

Concrete tool for launching an OpenAI-compatible HTTP API server for LLM inference provided by the LMDeploy library.

Description

The serve() function and its CLI wrapper lmdeploy serve api_server create a FastAPI/Uvicorn HTTP server that exposes LLM inference through OpenAI-compatible REST endpoints. It supports authentication, CORS, SSL, concurrent request limits, function calling, and reasoning output parsing.

Usage

Use this when deploying an LLM as a production HTTP service. Access via CLI for simple deployments or call the serve() function programmatically for integration into larger systems.

Code Reference

Source Location

Repository: lmdeploy
File: lmdeploy/serve/openai/api_server.py
Lines: L1388-1408 (serve function)
CLI: lmdeploy/cli/serve.py L200-334

Signature

def serve(model_path: str,
          model_name: Optional[str] = None,
          backend: Literal['turbomind', 'pytorch'] = 'turbomind',
          backend_config: Optional[Union[PytorchEngineConfig, TurbomindEngineConfig]] = None,
          chat_template_config: Optional[ChatTemplateConfig] = None,
          server_name: str = '0.0.0.0',
          server_port: int = 23333,
          allow_origins: List[str] = ['*'],
          allow_credentials: bool = True,
          allow_methods: List[str] = ['*'],
          allow_headers: List[str] = ['*'],
          log_level: str = 'ERROR',
          api_keys: Optional[Union[List[str], str]] = None,
          ssl: bool = False,
          proxy_url: Optional[str] = None,
          max_log_len: int = None,
          disable_fastapi_docs: bool = False,
          max_concurrent_requests: Optional[int] = None,
          reasoning_parser: Optional[str] = None,
          tool_call_parser: Optional[str] = None,
          allow_terminate_by_client: bool = False,
          enable_abort_handling: bool = False,
          speculative_config: Optional[SpeculativeConfig] = None,
          **kwargs) -> None:

Import

from lmdeploy.serve.openai.api_server import serve

I/O Contract

Inputs

Name	Type	Required	Description
model_path	str	Yes	Model path or HuggingFace ID
server_name	str	No	Host IP binding (default: '0.0.0.0')
server_port	int	No	Port number (default: 23333)
backend	str	No	'turbomind' or 'pytorch' (default: 'turbomind')
backend_config	EngineConfig	No	Engine configuration
api_keys	List[str] or str	No	Authentication keys
ssl	bool	No	Enable HTTPS (requires SSL_KEYFILE, SSL_CERTFILE env vars)
max_concurrent_requests	int	No	Request throttling limit
tool_call_parser	str	No	Function calling parser name

Outputs

Name	Type	Description
HTTP Server	Running Process	FastAPI/Uvicorn server on host:port with /v1/ endpoints

Usage Examples

CLI Launch

# Basic launch
lmdeploy serve api_server internlm/internlm2_5-7b-chat

# With tensor parallelism and custom port
lmdeploy serve api_server internlm/internlm2_5-7b-chat \
    --tp 2 \
    --server-port 8080 \
    --cache-max-entry-count 0.9

# With authentication
lmdeploy serve api_server internlm/internlm2_5-7b-chat \
    --api-keys "key1,key2"

Python Usage

from lmdeploy.serve.openai.api_server import serve
from lmdeploy import TurbomindEngineConfig

serve(
    model_path='internlm/internlm2_5-7b-chat',
    backend_config=TurbomindEngineConfig(tp=2),
    server_port=8080,
    api_keys=['my-secret-key']
)

Related Pages

Implements Principle

Principle:InternLM_Lmdeploy_API_Server_Deployment

Requires Environment

Environment:InternLM_Lmdeploy_Python_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment