Implementation:InternLM Lmdeploy Profile Restful Api

Knowledge Sources	InternLM_Lmdeploy
Domains	Benchmarking, Performance, API
Last Updated	2026-02-07 15:00 GMT

Overview

An asynchronous benchmarking script adapted from SGLang/vLLM that profiles the serving throughput and latency of RESTful API endpoints across multiple LLM inference backends including lmdeploy, vLLM, SGLang, and TensorRT-LLM.

Description

The profile_restful_api.py script is a comprehensive serving benchmark tool that measures online inference performance by sending concurrent HTTP requests to API servers. It supports multiple backends and API protocols:

Supported backends:

lmdeploy / lmdeploy-chat: OpenAI-compatible completions/chat API on port 23333
vllm / vllm-chat: OpenAI-compatible API on port 8000
sglang / sglang-native / sglang-oai / sglang-oai-chat: SGLang APIs on port 30000
trt: TensorRT-LLM generate_stream endpoint

Key data structures:

RequestFuncInput: Encapsulates request parameters (prompt, API URL, lengths, model, image data, extra body).
RequestFuncOutput: Captures response metrics (generated text, success flag, latency, TTFT, inter-token latencies).
BenchmarkMetrics: Aggregated metrics including throughput, latency percentiles, and token counts.
DatasetRow: Dataset entry with prompt, token lengths, and optional vision data.

Dataset support:

sharegpt: ShareGPT conversational dataset with automatic download
random: Synthetic prompts with configurable input/output lengths
image: Vision benchmarking with generated images at configurable resolutions (360p to 4K)

Request generation: Uses Poisson process for request arrival times when a finite request rate is specified, or sends all requests simultaneously when rate is infinity.

Multi-rate benchmarking: The --multi flag enables sweeping across a range of request rates to generate throughput-latency curves.

Results are saved to CSV and JSONL files with full metric breakdowns.

Usage

Used to benchmark any OpenAI-compatible or SGLang-compatible API server. Requires a running server process.

Code Reference

Source Location

Repository: InternLM_Lmdeploy
File: benchmark/profile_restful_api.py
Lines: 1-1500

Signature

@dataclass
class RequestFuncInput:
    prompt: str
    api_url: str
    prompt_len: int
    output_len: int
    model: str
    image_data: Optional[List[str]]
    extra_request_body: Dict[str, Any]

@dataclass
class RequestFuncOutput:
    generated_text: str = ''
    success: bool = False
    latency: float = 0.0
    ttft: float = 0.0
    itl: List[float] = field(default_factory=list)
    prompt_len: int = 0
    output_len: int = 0
    error: str = ''

async def benchmark(backend, api_url, model_id, tokenizer,
                    input_requests, request_rate, disable_tqdm,
                    extra_request_body): ...

def run_benchmark(args_: argparse.Namespace): ...

Import

# Standalone script, not typically imported
# Run directly:
# python benchmark/profile_restful_api.py --backend lmdeploy --num-prompts 1000

I/O Contract

Inputs

Name	Type	Required	Description
--backend	str	Yes	Backend type: lmdeploy, lmdeploy-chat, vllm, sglang, trt, etc.
--host	str	No	Server hostname (default: 0.0.0.0)
--port	int	No	Server port (auto-detected per backend)
--dataset-name	str	No	Dataset: sharegpt, random, or image (default: sharegpt)
--dataset-path	str	No	Path to dataset file
--num-prompts	int	No	Number of prompts (default: 1000)
--request-rate	float	No	Requests per second (default: inf)
--random-input-len	int	No	Input token length for random dataset
--random-output-len	int	No	Output token length for random dataset
--multi	flag	No	Enable multi-rate benchmarking
--request-rate-range	str	No	Rate range as start,stop,step (default: 2,34,2)
--model	str	No	Model name/path for API requests
--image-count	int	No	Images per request for image dataset (default: 1)
--image-resolution	str	No	Resolution: 4k, 1080p, 720p, 360p, or HxW (default: 1080p)

Outputs

Name	Type	Description
Console output	text	Formatted benchmark results with throughput and latency stats
CSV file	file	Tabular results with per-rate metrics
JSONL file	file	Detailed results including per-request data

Usage Examples

# Benchmark lmdeploy server with ShareGPT
# python benchmark/profile_restful_api.py \
#     --backend lmdeploy \
#     --host 127.0.0.1 \
#     --port 23333 \
#     --dataset-name sharegpt \
#     --dataset-path /path/to/ShareGPT.json \
#     --num-prompts 1000 \
#     --request-rate 10

# Multi-rate sweep for throughput-latency curve
# python benchmark/profile_restful_api.py \
#     --backend lmdeploy \
#     --dataset-name random \
#     --random-input-len 1024 \
#     --random-output-len 512 \
#     --multi \
#     --request-rate-range "2,34,2"

# Vision model benchmarking
# python benchmark/profile_restful_api.py \
#     --backend lmdeploy-chat \
#     --dataset-name image \
#     --image-count 2 \
#     --image-resolution 720p \
#     --random-input-len 128 \
#     --random-output-len 256

Related Pages

Environment:InternLM_Lmdeploy_CUDA_GPU_Runtime

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment