Implementation:InternLM Lmdeploy Profile Restful Api
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, Performance, API |
| Last Updated | 2026-02-07 15:00 GMT |
Overview
An asynchronous benchmarking script adapted from SGLang/vLLM that profiles the serving throughput and latency of RESTful API endpoints across multiple LLM inference backends including lmdeploy, vLLM, SGLang, and TensorRT-LLM.
Description
The profile_restful_api.py script is a comprehensive serving benchmark tool that measures online inference performance by sending concurrent HTTP requests to API servers. It supports multiple backends and API protocols:
Supported backends:
lmdeploy/lmdeploy-chat: OpenAI-compatible completions/chat API on port 23333vllm/vllm-chat: OpenAI-compatible API on port 8000sglang/sglang-native/sglang-oai/sglang-oai-chat: SGLang APIs on port 30000trt: TensorRT-LLM generate_stream endpoint
Key data structures:
RequestFuncInput: Encapsulates request parameters (prompt, API URL, lengths, model, image data, extra body).RequestFuncOutput: Captures response metrics (generated text, success flag, latency, TTFT, inter-token latencies).BenchmarkMetrics: Aggregated metrics including throughput, latency percentiles, and token counts.DatasetRow: Dataset entry with prompt, token lengths, and optional vision data.
Dataset support:
sharegpt: ShareGPT conversational dataset with automatic downloadrandom: Synthetic prompts with configurable input/output lengthsimage: Vision benchmarking with generated images at configurable resolutions (360p to 4K)
Request generation: Uses Poisson process for request arrival times when a finite request rate is specified, or sends all requests simultaneously when rate is infinity.
Multi-rate benchmarking: The --multi flag enables sweeping across a range of request rates to generate throughput-latency curves.
Results are saved to CSV and JSONL files with full metric breakdowns.
Usage
Used to benchmark any OpenAI-compatible or SGLang-compatible API server. Requires a running server process.
Code Reference
Source Location
- Repository: InternLM_Lmdeploy
- File: benchmark/profile_restful_api.py
- Lines: 1-1500
Signature
@dataclass
class RequestFuncInput:
prompt: str
api_url: str
prompt_len: int
output_len: int
model: str
image_data: Optional[List[str]]
extra_request_body: Dict[str, Any]
@dataclass
class RequestFuncOutput:
generated_text: str = ''
success: bool = False
latency: float = 0.0
ttft: float = 0.0
itl: List[float] = field(default_factory=list)
prompt_len: int = 0
output_len: int = 0
error: str = ''
async def benchmark(backend, api_url, model_id, tokenizer,
input_requests, request_rate, disable_tqdm,
extra_request_body): ...
def run_benchmark(args_: argparse.Namespace): ...
Import
# Standalone script, not typically imported
# Run directly:
# python benchmark/profile_restful_api.py --backend lmdeploy --num-prompts 1000
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| --backend | str | Yes | Backend type: lmdeploy, lmdeploy-chat, vllm, sglang, trt, etc. |
| --host | str | No | Server hostname (default: 0.0.0.0) |
| --port | int | No | Server port (auto-detected per backend) |
| --dataset-name | str | No | Dataset: sharegpt, random, or image (default: sharegpt) |
| --dataset-path | str | No | Path to dataset file |
| --num-prompts | int | No | Number of prompts (default: 1000) |
| --request-rate | float | No | Requests per second (default: inf) |
| --random-input-len | int | No | Input token length for random dataset |
| --random-output-len | int | No | Output token length for random dataset |
| --multi | flag | No | Enable multi-rate benchmarking |
| --request-rate-range | str | No | Rate range as start,stop,step (default: 2,34,2) |
| --model | str | No | Model name/path for API requests |
| --image-count | int | No | Images per request for image dataset (default: 1) |
| --image-resolution | str | No | Resolution: 4k, 1080p, 720p, 360p, or HxW (default: 1080p) |
Outputs
| Name | Type | Description |
|---|---|---|
| Console output | text | Formatted benchmark results with throughput and latency stats |
| CSV file | file | Tabular results with per-rate metrics |
| JSONL file | file | Detailed results including per-request data |
Usage Examples
# Benchmark lmdeploy server with ShareGPT
# python benchmark/profile_restful_api.py \
# --backend lmdeploy \
# --host 127.0.0.1 \
# --port 23333 \
# --dataset-name sharegpt \
# --dataset-path /path/to/ShareGPT.json \
# --num-prompts 1000 \
# --request-rate 10
# Multi-rate sweep for throughput-latency curve
# python benchmark/profile_restful_api.py \
# --backend lmdeploy \
# --dataset-name random \
# --random-input-len 1024 \
# --random-output-len 512 \
# --multi \
# --request-rate-range "2,34,2"
# Vision model benchmarking
# python benchmark/profile_restful_api.py \
# --backend lmdeploy-chat \
# --dataset-name image \
# --image-count 2 \
# --image-resolution 720p \
# --random-input-len 128 \
# --random-output-len 256