Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sail sg LongSpec VLLM Request Generator

From Leeroopedia
Knowledge Sources
Domains NLP, Inference, API
Last Updated 2026-02-14 05:00 GMT

Overview

Concrete tool for sending inference requests to a vLLM HTTP server with automatic retry and context-length overflow handling.

Description

The vllm.py module provides the VLLMRequestGenerator class and supporting functions for interacting with a vLLM inference server via HTTP. It supports three API modes: raw prompt endpoint, OpenAI-compatible completions endpoint, and OpenAI-compatible chat completions endpoint. The class includes automatic retry logic that reduces max_tokens when context length errors occur (up to 10 retries), and handles both single and multi-sample (n>1) responses.

Usage

Import this class when you need to send prompts to a vLLM server for inference during evaluation. Used as the generation backend in the post-processing evaluation pipeline.

Code Reference

Source Location

Signature

def post_http_request(
    api_url: str,
    n: int = 1,
    max_tokens: int = 16,
    temperature: float = 0.0,
    use_beam_search: bool = False,
    stream: bool = False,
    stop: List[str] = ["</s>"],
    **kwargs,
) -> requests.Response:
    """Send HTTP POST request to vLLM server."""

class VLLMRequestGenerator:
    def __init__(
        self,
        api_url: str,
        n: int = 1,
        max_tokens: int = 1024,
        use_beam_search: bool = False,
        stream: bool = False,
        temperature: float = 0.0,
        stop: Union[List[str], ListConfig] = ["</s>"],
        **kwargs,
    ):
        """HTTP client for vLLM inference with retry logic."""

    def __call__(self, prompt: str) -> Union[str, List[str]]:
        """Send prompt and return generated text (or list for n>1)."""

Import

from data.vllm import VLLMRequestGenerator

I/O Contract

Inputs

Name Type Required Description
api_url str Yes vLLM server endpoint URL
prompt str Yes Text prompt for generation
n int No Number of completions to generate (default 1)
max_tokens int No Maximum tokens to generate (default 1024)
temperature float No Sampling temperature (default 0.0)
stop List[str] No Stop sequences (default [""])

Outputs

Name Type Description
response str or List[str] Generated text (single string if n=1, list if n>1)

Usage Examples

from data.vllm import VLLMRequestGenerator

# Initialize client
generator = VLLMRequestGenerator(
    api_url="http://localhost:8000/v1/completions",
    max_tokens=512,
    temperature=0.0,
    stop=["</s>", "\n\n"],
)

# Generate completion
response = generator("What is 2 + 2? The answer is:")
print(response)  # " 4"

# Multi-sample generation
generator_multi = VLLMRequestGenerator(
    api_url="http://localhost:8000/v1/completions",
    n=5,
    temperature=0.7,
)
responses = generator_multi("Solve: x^2 = 4")
# responses = [" x=2", " x=-2", ...]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment