Implementation:Sail sg LongSpec VLLM Request Generator
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference, API |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Concrete tool for sending inference requests to a vLLM HTTP server with automatic retry and context-length overflow handling.
Description
The vllm.py module provides the VLLMRequestGenerator class and supporting functions for interacting with a vLLM inference server via HTTP. It supports three API modes: raw prompt endpoint, OpenAI-compatible completions endpoint, and OpenAI-compatible chat completions endpoint. The class includes automatic retry logic that reduces max_tokens when context length errors occur (up to 10 retries), and handles both single and multi-sample (n>1) responses.
Usage
Import this class when you need to send prompts to a vLLM server for inference during evaluation. Used as the generation backend in the post-processing evaluation pipeline.
Code Reference
Source Location
- Repository: Sail_sg_LongSpec
- File: longspec/train/data/vllm.py
- Lines: 1-141
Signature
def post_http_request(
api_url: str,
n: int = 1,
max_tokens: int = 16,
temperature: float = 0.0,
use_beam_search: bool = False,
stream: bool = False,
stop: List[str] = ["</s>"],
**kwargs,
) -> requests.Response:
"""Send HTTP POST request to vLLM server."""
class VLLMRequestGenerator:
def __init__(
self,
api_url: str,
n: int = 1,
max_tokens: int = 1024,
use_beam_search: bool = False,
stream: bool = False,
temperature: float = 0.0,
stop: Union[List[str], ListConfig] = ["</s>"],
**kwargs,
):
"""HTTP client for vLLM inference with retry logic."""
def __call__(self, prompt: str) -> Union[str, List[str]]:
"""Send prompt and return generated text (or list for n>1)."""
Import
from data.vllm import VLLMRequestGenerator
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| api_url | str | Yes | vLLM server endpoint URL |
| prompt | str | Yes | Text prompt for generation |
| n | int | No | Number of completions to generate (default 1) |
| max_tokens | int | No | Maximum tokens to generate (default 1024) |
| temperature | float | No | Sampling temperature (default 0.0) |
| stop | List[str] | No | Stop sequences (default [""]) |
Outputs
| Name | Type | Description |
|---|---|---|
| response | str or List[str] | Generated text (single string if n=1, list if n>1) |
Usage Examples
from data.vllm import VLLMRequestGenerator
# Initialize client
generator = VLLMRequestGenerator(
api_url="http://localhost:8000/v1/completions",
max_tokens=512,
temperature=0.0,
stop=["</s>", "\n\n"],
)
# Generate completion
response = generator("What is 2 + 2? The answer is:")
print(response) # " 4"
# Multi-sample generation
generator_multi = VLLMRequestGenerator(
api_url="http://localhost:8000/v1/completions",
n=5,
temperature=0.7,
)
responses = generator_multi("Solve: x^2 = 4")
# responses = [" x=2", " x=-2", ...]