Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vllm project Vllm Backend Request Func

From Leeroopedia


Knowledge Sources
Domains Benchmarking, LLM Serving
Last Updated 2026-02-08 00:00 GMT

Overview

Provides async HTTP request functions for benchmarking multiple LLM serving backends with standardized input/output interfaces.

Description

This Python module implements async HTTP streaming clients for benchmarking various LLM serving backends including TGI, TensorRT-LLM, DeepSpeed-MII, and OpenAI-compatible APIs (vLLM, lmdeploy, sglang, llama.cpp, scalellm). It defines RequestFuncInput and RequestFuncOutput dataclasses that provide a common interface for all backends, handles SSE parsing, token counting, and captures latency metrics (TTFT, TPOT, ITL). The module is designed to run without vLLM installed, enabling independent benchmark execution.

Usage

This module is imported by vLLM benchmark scripts (such as benchmark_serving.py and benchmark_serving_structured_output.py) to send async requests to various LLM serving backends. Users select a backend via the ASYNC_REQUEST_FUNCS dictionary and pass RequestFuncInput objects to the corresponding async function.

Code Reference

Source Location

Signature

@dataclass
class RequestFuncInput:
    prompt: str
    api_url: str
    prompt_len: int
    output_len: int
    model: str
    model_name: str | None = None
    logprobs: int | None = None
    extra_body: dict | None = None
    multi_modal_content: dict | list[dict] | None = None
    ignore_eos: bool = False
    language: str | None = None
    request_id: str | None = None

@dataclass
class RequestFuncOutput:
    generated_text: str = ""
    success: bool = False
    latency: float = 0.0
    output_tokens: int = 0
    ttft: float = 0.0
    itl: list[float] = field(default_factory=list)
    tpot: float = 0.0
    prompt_len: int = 0
    error: str = ""

async def async_request_tgi(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_trt_llm(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_deepspeed_mii(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_openai_completions(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_openai_chat_completions(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput
async def async_request_openai_audio(request_func_input: RequestFuncInput, pbar: tqdm | None = None) -> RequestFuncOutput

def get_model(pretrained_model_name_or_path: str) -> str
def get_tokenizer(pretrained_model_name_or_path: str, tokenizer_mode: str = "auto", trust_remote_code: bool = False, **kwargs) -> PreTrainedTokenizer | PreTrainedTokenizerFast

ASYNC_REQUEST_FUNCS: dict[str, Callable]

Import

from backend_request_func import (
    ASYNC_REQUEST_FUNCS,
    RequestFuncInput,
    RequestFuncOutput,
    get_tokenizer,
)

I/O Contract

Inputs

Name Type Required Description
prompt str Yes The text prompt to send to the LLM serving backend
api_url str Yes The full API endpoint URL for the backend
prompt_len int Yes Length of the prompt in tokens
output_len int Yes Maximum number of tokens to generate
model str Yes Model identifier string
model_name str No Optional display name for the model
logprobs int No Number of log probabilities to return
extra_body dict No Additional request body parameters
multi_modal_content dict or list[dict] No Multi-modal content (images, audio) for supported backends
ignore_eos bool No Whether to ignore end-of-sequence token (default: False)
request_id str No Custom request ID sent via x-request-id header

Outputs

Name Type Description
generated_text str The text generated by the model
success bool Whether the request completed successfully
latency float Total request latency in seconds
output_tokens int Number of output tokens generated
ttft float Time to first token in seconds
itl list[float] List of inter-token latencies in seconds
tpot float Average time per output token in seconds
prompt_len int Length of the input prompt in tokens
error str Error message if the request failed

Usage Examples

import asyncio
from backend_request_func import (
    ASYNC_REQUEST_FUNCS,
    RequestFuncInput,
)

# Create a request
request = RequestFuncInput(
    prompt="What is the capital of France?",
    api_url="http://localhost:8000/v1/completions",
    prompt_len=10,
    output_len=128,
    model="meta-llama/Llama-2-7b-hf",
)

# Select the vLLM backend (OpenAI-compatible completions)
request_func = ASYNC_REQUEST_FUNCS["vllm"]

# Send the request
output = asyncio.run(request_func(request))
print(f"Generated: {output.generated_text}")
print(f"TTFT: {output.ttft:.3f}s, Latency: {output.latency:.3f}s")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment