Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:InternLM Lmdeploy Profile Restful Api

From Leeroopedia
Revision as of 15:15, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/InternLM_Lmdeploy_Profile_Restful_Api.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Benchmarking, Performance, API
Last Updated 2026-02-07 15:00 GMT

Overview

An asynchronous benchmarking script adapted from SGLang/vLLM that profiles the serving throughput and latency of RESTful API endpoints across multiple LLM inference backends including lmdeploy, vLLM, SGLang, and TensorRT-LLM.

Description

The profile_restful_api.py script is a comprehensive serving benchmark tool that measures online inference performance by sending concurrent HTTP requests to API servers. It supports multiple backends and API protocols:

Supported backends:

  • lmdeploy / lmdeploy-chat: OpenAI-compatible completions/chat API on port 23333
  • vllm / vllm-chat: OpenAI-compatible API on port 8000
  • sglang / sglang-native / sglang-oai / sglang-oai-chat: SGLang APIs on port 30000
  • trt: TensorRT-LLM generate_stream endpoint

Key data structures:

  • RequestFuncInput: Encapsulates request parameters (prompt, API URL, lengths, model, image data, extra body).
  • RequestFuncOutput: Captures response metrics (generated text, success flag, latency, TTFT, inter-token latencies).
  • BenchmarkMetrics: Aggregated metrics including throughput, latency percentiles, and token counts.
  • DatasetRow: Dataset entry with prompt, token lengths, and optional vision data.

Dataset support:

  • sharegpt: ShareGPT conversational dataset with automatic download
  • random: Synthetic prompts with configurable input/output lengths
  • image: Vision benchmarking with generated images at configurable resolutions (360p to 4K)

Request generation: Uses Poisson process for request arrival times when a finite request rate is specified, or sends all requests simultaneously when rate is infinity.

Multi-rate benchmarking: The --multi flag enables sweeping across a range of request rates to generate throughput-latency curves.

Results are saved to CSV and JSONL files with full metric breakdowns.

Usage

Used to benchmark any OpenAI-compatible or SGLang-compatible API server. Requires a running server process.

Code Reference

Source Location

Signature

@dataclass
class RequestFuncInput:
    prompt: str
    api_url: str
    prompt_len: int
    output_len: int
    model: str
    image_data: Optional[List[str]]
    extra_request_body: Dict[str, Any]

@dataclass
class RequestFuncOutput:
    generated_text: str = ''
    success: bool = False
    latency: float = 0.0
    ttft: float = 0.0
    itl: List[float] = field(default_factory=list)
    prompt_len: int = 0
    output_len: int = 0
    error: str = ''

async def benchmark(backend, api_url, model_id, tokenizer,
                    input_requests, request_rate, disable_tqdm,
                    extra_request_body): ...

def run_benchmark(args_: argparse.Namespace): ...

Import

# Standalone script, not typically imported
# Run directly:
# python benchmark/profile_restful_api.py --backend lmdeploy --num-prompts 1000

I/O Contract

Inputs

Name Type Required Description
--backend str Yes Backend type: lmdeploy, lmdeploy-chat, vllm, sglang, trt, etc.
--host str No Server hostname (default: 0.0.0.0)
--port int No Server port (auto-detected per backend)
--dataset-name str No Dataset: sharegpt, random, or image (default: sharegpt)
--dataset-path str No Path to dataset file
--num-prompts int No Number of prompts (default: 1000)
--request-rate float No Requests per second (default: inf)
--random-input-len int No Input token length for random dataset
--random-output-len int No Output token length for random dataset
--multi flag No Enable multi-rate benchmarking
--request-rate-range str No Rate range as start,stop,step (default: 2,34,2)
--model str No Model name/path for API requests
--image-count int No Images per request for image dataset (default: 1)
--image-resolution str No Resolution: 4k, 1080p, 720p, 360p, or HxW (default: 1080p)

Outputs

Name Type Description
Console output text Formatted benchmark results with throughput and latency stats
CSV file file Tabular results with per-rate metrics
JSONL file file Detailed results including per-request data

Usage Examples

# Benchmark lmdeploy server with ShareGPT
# python benchmark/profile_restful_api.py \
#     --backend lmdeploy \
#     --host 127.0.0.1 \
#     --port 23333 \
#     --dataset-name sharegpt \
#     --dataset-path /path/to/ShareGPT.json \
#     --num-prompts 1000 \
#     --request-rate 10

# Multi-rate sweep for throughput-latency curve
# python benchmark/profile_restful_api.py \
#     --backend lmdeploy \
#     --dataset-name random \
#     --random-input-len 1024 \
#     --random-output-len 512 \
#     --multi \
#     --request-rate-range "2,34,2"

# Vision model benchmarking
# python benchmark/profile_restful_api.py \
#     --backend lmdeploy-chat \
#     --dataset-name image \
#     --image-count 2 \
#     --image-resolution 720p \
#     --random-input-len 128 \
#     --random-output-len 256

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment