Implementation:InternLM Lmdeploy Serve Api Server
Appearance
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, REST_API |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
Concrete tool for launching an OpenAI-compatible HTTP API server for LLM inference provided by the LMDeploy library.
Description
The serve() function and its CLI wrapper lmdeploy serve api_server create a FastAPI/Uvicorn HTTP server that exposes LLM inference through OpenAI-compatible REST endpoints. It supports authentication, CORS, SSL, concurrent request limits, function calling, and reasoning output parsing.
Usage
Use this when deploying an LLM as a production HTTP service. Access via CLI for simple deployments or call the serve() function programmatically for integration into larger systems.
Code Reference
Source Location
- Repository: lmdeploy
- File: lmdeploy/serve/openai/api_server.py
- Lines: L1388-1408 (serve function)
- CLI: lmdeploy/cli/serve.py L200-334
Signature
def serve(model_path: str,
model_name: Optional[str] = None,
backend: Literal['turbomind', 'pytorch'] = 'turbomind',
backend_config: Optional[Union[PytorchEngineConfig, TurbomindEngineConfig]] = None,
chat_template_config: Optional[ChatTemplateConfig] = None,
server_name: str = '0.0.0.0',
server_port: int = 23333,
allow_origins: List[str] = ['*'],
allow_credentials: bool = True,
allow_methods: List[str] = ['*'],
allow_headers: List[str] = ['*'],
log_level: str = 'ERROR',
api_keys: Optional[Union[List[str], str]] = None,
ssl: bool = False,
proxy_url: Optional[str] = None,
max_log_len: int = None,
disable_fastapi_docs: bool = False,
max_concurrent_requests: Optional[int] = None,
reasoning_parser: Optional[str] = None,
tool_call_parser: Optional[str] = None,
allow_terminate_by_client: bool = False,
enable_abort_handling: bool = False,
speculative_config: Optional[SpeculativeConfig] = None,
**kwargs) -> None:
Import
from lmdeploy.serve.openai.api_server import serve
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_path | str | Yes | Model path or HuggingFace ID |
| server_name | str | No | Host IP binding (default: '0.0.0.0') |
| server_port | int | No | Port number (default: 23333) |
| backend | str | No | 'turbomind' or 'pytorch' (default: 'turbomind') |
| backend_config | EngineConfig | No | Engine configuration |
| api_keys | List[str] or str | No | Authentication keys |
| ssl | bool | No | Enable HTTPS (requires SSL_KEYFILE, SSL_CERTFILE env vars) |
| max_concurrent_requests | int | No | Request throttling limit |
| tool_call_parser | str | No | Function calling parser name |
Outputs
| Name | Type | Description |
|---|---|---|
| HTTP Server | Running Process | FastAPI/Uvicorn server on host:port with /v1/ endpoints |
Usage Examples
CLI Launch
# Basic launch
lmdeploy serve api_server internlm/internlm2_5-7b-chat
# With tensor parallelism and custom port
lmdeploy serve api_server internlm/internlm2_5-7b-chat \
--tp 2 \
--server-port 8080 \
--cache-max-entry-count 0.9
# With authentication
lmdeploy serve api_server internlm/internlm2_5-7b-chat \
--api-keys "key1,key2"
Python Usage
from lmdeploy.serve.openai.api_server import serve
from lmdeploy import TurbomindEngineConfig
serve(
model_path='internlm/internlm2_5-7b-chat',
backend_config=TurbomindEngineConfig(tp=2),
server_port=8080,
api_keys=['my-secret-key']
)
Related Pages
Implements Principle
Requires Environment
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment