Implementation:Sail sg LongSpec VLLM Request Generator

Knowledge Sources	Sail_sg_LongSpec
Domains	NLP, Inference, API
Last Updated	2026-02-14 05:00 GMT

Overview

Concrete tool for sending inference requests to a vLLM HTTP server with automatic retry and context-length overflow handling.

Description

The vllm.py module provides the VLLMRequestGenerator class and supporting functions for interacting with a vLLM inference server via HTTP. It supports three API modes: raw prompt endpoint, OpenAI-compatible completions endpoint, and OpenAI-compatible chat completions endpoint. The class includes automatic retry logic that reduces max_tokens when context length errors occur (up to 10 retries), and handles both single and multi-sample (n>1) responses.

Usage

Import this class when you need to send prompts to a vLLM server for inference during evaluation. Used as the generation backend in the post-processing evaluation pipeline.

Code Reference

Source Location

Repository: Sail_sg_LongSpec
File: longspec/train/data/vllm.py
Lines: 1-141

Signature

def post_http_request(
    api_url: str,
    n: int = 1,
    max_tokens: int = 16,
    temperature: float = 0.0,
    use_beam_search: bool = False,
    stream: bool = False,
    stop: List[str] = ["</s>"],
    **kwargs,
) -> requests.Response:
    """Send HTTP POST request to vLLM server."""

class VLLMRequestGenerator:
    def __init__(
        self,
        api_url: str,
        n: int = 1,
        max_tokens: int = 1024,
        use_beam_search: bool = False,
        stream: bool = False,
        temperature: float = 0.0,
        stop: Union[List[str], ListConfig] = ["</s>"],
        **kwargs,
    ):
        """HTTP client for vLLM inference with retry logic."""

    def __call__(self, prompt: str) -> Union[str, List[str]]:
        """Send prompt and return generated text (or list for n>1)."""

Import

from data.vllm import VLLMRequestGenerator

I/O Contract

Inputs

Name	Type	Required	Description
api_url	str	Yes	vLLM server endpoint URL
prompt	str	Yes	Text prompt for generation
n	int	No	Number of completions to generate (default 1)
max_tokens	int	No	Maximum tokens to generate (default 1024)
temperature	float	No	Sampling temperature (default 0.0)
stop	List[str]	No	Stop sequences (default [""])

Outputs

Name	Type	Description
response	str or List[str]	Generated text (single string if n=1, list if n>1)

Usage Examples

from data.vllm import VLLMRequestGenerator

# Initialize client
generator = VLLMRequestGenerator(
    api_url="http://localhost:8000/v1/completions",
    max_tokens=512,
    temperature=0.0,
    stop=["</s>", "\n\n"],
)

# Generate completion
response = generator("What is 2 + 2? The answer is:")
print(response)  # " 4"

# Multi-sample generation
generator_multi = VLLMRequestGenerator(
    api_url="http://localhost:8000/v1/completions",
    n=5,
    temperature=0.7,
)
responses = generator_multi("Solve: x^2 = 4")
# responses = [" x=2", " x=-2", ...]

Related Pages

Environment:Sail_sg_LongSpec_Training_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment