Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:InternLM Lmdeploy Serve Api Server

From Leeroopedia


Knowledge Sources
Domains LLM_Serving, REST_API
Last Updated 2026-02-07 15:00 GMT

Overview

Concrete tool for launching an OpenAI-compatible HTTP API server for LLM inference provided by the LMDeploy library.

Description

The serve() function and its CLI wrapper lmdeploy serve api_server create a FastAPI/Uvicorn HTTP server that exposes LLM inference through OpenAI-compatible REST endpoints. It supports authentication, CORS, SSL, concurrent request limits, function calling, and reasoning output parsing.

Usage

Use this when deploying an LLM as a production HTTP service. Access via CLI for simple deployments or call the serve() function programmatically for integration into larger systems.

Code Reference

Source Location

  • Repository: lmdeploy
  • File: lmdeploy/serve/openai/api_server.py
  • Lines: L1388-1408 (serve function)
  • CLI: lmdeploy/cli/serve.py L200-334

Signature

def serve(model_path: str,
          model_name: Optional[str] = None,
          backend: Literal['turbomind', 'pytorch'] = 'turbomind',
          backend_config: Optional[Union[PytorchEngineConfig, TurbomindEngineConfig]] = None,
          chat_template_config: Optional[ChatTemplateConfig] = None,
          server_name: str = '0.0.0.0',
          server_port: int = 23333,
          allow_origins: List[str] = ['*'],
          allow_credentials: bool = True,
          allow_methods: List[str] = ['*'],
          allow_headers: List[str] = ['*'],
          log_level: str = 'ERROR',
          api_keys: Optional[Union[List[str], str]] = None,
          ssl: bool = False,
          proxy_url: Optional[str] = None,
          max_log_len: int = None,
          disable_fastapi_docs: bool = False,
          max_concurrent_requests: Optional[int] = None,
          reasoning_parser: Optional[str] = None,
          tool_call_parser: Optional[str] = None,
          allow_terminate_by_client: bool = False,
          enable_abort_handling: bool = False,
          speculative_config: Optional[SpeculativeConfig] = None,
          **kwargs) -> None:

Import

from lmdeploy.serve.openai.api_server import serve

I/O Contract

Inputs

Name Type Required Description
model_path str Yes Model path or HuggingFace ID
server_name str No Host IP binding (default: '0.0.0.0')
server_port int No Port number (default: 23333)
backend str No 'turbomind' or 'pytorch' (default: 'turbomind')
backend_config EngineConfig No Engine configuration
api_keys List[str] or str No Authentication keys
ssl bool No Enable HTTPS (requires SSL_KEYFILE, SSL_CERTFILE env vars)
max_concurrent_requests int No Request throttling limit
tool_call_parser str No Function calling parser name

Outputs

Name Type Description
HTTP Server Running Process FastAPI/Uvicorn server on host:port with /v1/ endpoints

Usage Examples

CLI Launch

# Basic launch
lmdeploy serve api_server internlm/internlm2_5-7b-chat

# With tensor parallelism and custom port
lmdeploy serve api_server internlm/internlm2_5-7b-chat \
    --tp 2 \
    --server-port 8080 \
    --cache-max-entry-count 0.9

# With authentication
lmdeploy serve api_server internlm/internlm2_5-7b-chat \
    --api-keys "key1,key2"

Python Usage

from lmdeploy.serve.openai.api_server import serve
from lmdeploy import TurbomindEngineConfig

serve(
    model_path='internlm/internlm2_5-7b-chat',
    backend_config=TurbomindEngineConfig(tp=2),
    server_port=8080,
    api_keys=['my-secret-key']
)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment