Implementation:Vllm project Vllm Backend Request Func

Knowledge Sources	vllm
Domains	Benchmarking, LLM Serving
Last Updated	2026-02-08 00:00 GMT

Overview

Provides async HTTP request functions for benchmarking multiple LLM serving backends with standardized input/output interfaces.

Description

This Python module implements async HTTP streaming clients for benchmarking various LLM serving backends including TGI, TensorRT-LLM, DeepSpeed-MII, and OpenAI-compatible APIs (vLLM, lmdeploy, sglang, llama.cpp, scalellm). It defines RequestFuncInput and RequestFuncOutput dataclasses that provide a common interface for all backends, handles SSE parsing, token counting, and captures latency metrics (TTFT, TPOT, ITL). The module is designed to run without vLLM installed, enabling independent benchmark execution.

Usage

This module is imported by vLLM benchmark scripts (such as benchmark_serving.py and benchmark_serving_structured_output.py) to send async requests to various LLM serving backends. Users select a backend via the ASYNC_REQUEST_FUNCS dictionary and pass RequestFuncInput objects to the corresponding async function.

Code Reference

Source Location

Repository: vllm
File: benchmarks/backend_request_func.py
Lines: 1-657

Signature

@dataclass
class RequestFuncInput:
    prompt: str
    api_url: str
    prompt_len: int
    output_len: int
    model: str
    model_name: str | None = None
    logprobs: int | None = None
    extra_body: dict | None = None
    multi_modal_content: dict | list[dict] | None = None
    ignore_eos: bool = False
    language: str | None = None
    request_id: str | None = None

@dataclass
class RequestFuncOutput:
    generated_text: str = ""
    success: bool = False
    latency: float = 0.0
    output_tokens: int = 0
    ttft: float = 0.0
    itl: list[float] = field(default_factory=list)
    tpot: float = 0.0
    prompt_len: int = 0
    error: str = ""

async def async_request_tgi(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_trt_llm(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_deepspeed_mii(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_openai_completions(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_openai_chat_completions(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_openai_audio(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput

def get_model(pretrained_model_name_or_path: str) -> str
def get_tokenizer(pretrained_model_name_or_path: str, tokenizer_mode: str = "auto", trust_remote_code: bool = False, **kwargs) -> PreTrainedTokenizer | PreTrainedTokenizerFast

ASYNC_REQUEST_FUNCS: dict[str, Callable]

Import

from backend_request_func import (
    ASYNC_REQUEST_FUNCS,
    RequestFuncInput,
    RequestFuncOutput,
    get_tokenizer,
)

I/O Contract

Inputs

Name	Type	Required	Description
prompt	str	Yes	The text prompt to send to the LLM serving backend
api_url	str	Yes	The full API endpoint URL for the backend
prompt_len	int	Yes	Length of the prompt in tokens
output_len	int	Yes	Maximum number of tokens to generate
model	str	Yes	Model identifier string
model_name	str	No	Optional display name for the model
logprobs	int	No	Number of log probabilities to return
extra_body	dict	No	Additional request body parameters
multi_modal_content	dict or list[dict]	No	Multi-modal content (images, audio) for supported backends
ignore_eos	bool	No	Whether to ignore end-of-sequence token (default: False)
request_id	str	No	Custom request ID sent via x-request-id header

Outputs

Name	Type	Description
generated_text	str	The text generated by the model
success	bool	Whether the request completed successfully
latency	float	Total request latency in seconds
output_tokens	int	Number of output tokens generated
ttft	float	Time to first token in seconds
itl	list[float]	List of inter-token latencies in seconds
tpot	float	Average time per output token in seconds
prompt_len	int	Length of the input prompt in tokens
error	str	Error message if the request failed

Usage Examples

import asyncio
from backend_request_func import (
    ASYNC_REQUEST_FUNCS,
    RequestFuncInput,
)

# Create a request
request = RequestFuncInput(
    prompt="What is the capital of France?",
    api_url="http://localhost:8000/v1/completions",
    prompt_len=10,
    output_len=128,
    model="meta-llama/Llama-2-7b-hf",
)

# Select the vLLM backend (OpenAI-compatible completions)
request_func = ASYNC_REQUEST_FUNCS["vllm"]

# Send the request
output = asyncio.run(request_func(request))
print(f"Generated: {output.generated_text}")
print(f"TTFT: {output.ttft:.3f}s, Latency: {output.latency:.3f}s")

Related Pages

Environment:Vllm_project_Vllm_Benchmarks

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment