Principle:Sail sg LongSpec VLLM Inference Client

Knowledge Sources	Sail_sg_LongSpec
Domains	NLP, Inference, API
Last Updated	2026-02-14 05:00 GMT

Overview

Architectural principle for interacting with vLLM inference servers via HTTP with automatic retry logic and context-length overflow handling.

Description

The vLLM Inference Client principle describes the pattern of using an HTTP client to send prompts to a vLLM server for batch inference during evaluation. The client must handle three API modes (raw prompt, OpenAI completions, OpenAI chat completions) and gracefully handle context-length overflow errors by progressively reducing max_tokens. This enables evaluation pipelines to use vLLM-hosted models without tight coupling to the model loading code.

Usage

Apply this principle when the evaluation pipeline needs to query a separately-hosted vLLM inference server rather than loading the model locally. This is common in distributed evaluation setups where the model server and evaluation orchestrator run on different machines.

Theoretical Basis

The retry pattern follows exponential backoff on context overflow:

# Abstract algorithm (NOT real implementation)
def request_with_retry(prompt, max_tokens, max_retry):
    for attempt in range(max_retry):
        response = http_post(prompt, max_tokens)
        if response.ok:
            return parse(response)
        elif "context length" in response.error:
            max_tokens -= 100  # Reduce and retry
        else:
            return empty_response

Key design decisions:

Multi-API support: Single client handles raw, completions, and chat endpoints
Graceful degradation: Reduce max_tokens instead of failing on context overflow
Multi-sample: Support n>1 for self-consistency evaluation

Related Pages

Implementation:Sail_sg_LongSpec_VLLM_Request_Generator

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment