Implementation:Vllm project Vllm Backend Request Func
| Knowledge Sources | |
|---|---|
| Domains | Benchmarking, LLM Serving |
| Last Updated | 2026-02-08 00:00 GMT |
Overview
Provides async HTTP request functions for benchmarking multiple LLM serving backends with standardized input/output interfaces.
Description
This Python module implements async HTTP streaming clients for benchmarking various LLM serving backends including TGI, TensorRT-LLM, DeepSpeed-MII, and OpenAI-compatible APIs (vLLM, lmdeploy, sglang, llama.cpp, scalellm). It defines RequestFuncInput and RequestFuncOutput dataclasses that provide a common interface for all backends, handles SSE parsing, token counting, and captures latency metrics (TTFT, TPOT, ITL). The module is designed to run without vLLM installed, enabling independent benchmark execution.
Usage
This module is imported by vLLM benchmark scripts (such as benchmark_serving.py and benchmark_serving_structured_output.py) to send async requests to various LLM serving backends. Users select a backend via the ASYNC_REQUEST_FUNCS dictionary and pass RequestFuncInput objects to the corresponding async function.
Code Reference
Source Location
- Repository: vllm
- File: benchmarks/backend_request_func.py
- Lines: 1-657
Signature
@dataclass
class RequestFuncInput:
prompt: str
api_url: str
prompt_len: int
output_len: int
model: str
model_name: str | None = None
logprobs: int | None = None
extra_body: dict | None = None
multi_modal_content: dict | list[dict] | None = None
ignore_eos: bool = False
language: str | None = None
request_id: str | None = None
@dataclass
class RequestFuncOutput:
generated_text: str = ""
success: bool = False
latency: float = 0.0
output_tokens: int = 0
ttft: float = 0.0
itl: list[float] = field(default_factory=list)
tpot: float = 0.0
prompt_len: int = 0
error: str = ""
async def async_request_tgi(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_trt_llm(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_deepspeed_mii(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_openai_completions(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_openai_chat_completions(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_openai_audio(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
def get_model(pretrained_model_name_or_path: str) -> str
def get_tokenizer(pretrained_model_name_or_path: str, tokenizer_mode: str = "auto", trust_remote_code: bool = False, **kwargs) -> PreTrainedTokenizer | PreTrainedTokenizerFast
ASYNC_REQUEST_FUNCS: dict[str, Callable]
Import
from backend_request_func import (
ASYNC_REQUEST_FUNCS,
RequestFuncInput,
RequestFuncOutput,
get_tokenizer,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| prompt | str | Yes | The text prompt to send to the LLM serving backend |
| api_url | str | Yes | The full API endpoint URL for the backend |
| prompt_len | int | Yes | Length of the prompt in tokens |
| output_len | int | Yes | Maximum number of tokens to generate |
| model | str | Yes | Model identifier string |
| model_name | str | No | Optional display name for the model |
| logprobs | int | No | Number of log probabilities to return |
| extra_body | dict | No | Additional request body parameters |
| multi_modal_content | dict or list[dict] | No | Multi-modal content (images, audio) for supported backends |
| ignore_eos | bool | No | Whether to ignore end-of-sequence token (default: False) |
| request_id | str | No | Custom request ID sent via x-request-id header |
Outputs
| Name | Type | Description |
|---|---|---|
| generated_text | str | The text generated by the model |
| success | bool | Whether the request completed successfully |
| latency | float | Total request latency in seconds |
| output_tokens | int | Number of output tokens generated |
| ttft | float | Time to first token in seconds |
| itl | list[float] | List of inter-token latencies in seconds |
| tpot | float | Average time per output token in seconds |
| prompt_len | int | Length of the input prompt in tokens |
| error | str | Error message if the request failed |
Usage Examples
import asyncio
from backend_request_func import (
ASYNC_REQUEST_FUNCS,
RequestFuncInput,
)
# Create a request
request = RequestFuncInput(
prompt="What is the capital of France?",
api_url="http://localhost:8000/v1/completions",
prompt_len=10,
output_len=128,
model="meta-llama/Llama-2-7b-hf",
)
# Select the vLLM backend (OpenAI-compatible completions)
request_func = ASYNC_REQUEST_FUNCS["vllm"]
# Send the request
output = asyncio.run(request_func(request))
print(f"Generated: {output.generated_text}")
print(f"TTFT: {output.ttft:.3f}s, Latency: {output.latency:.3f}s")