Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Sail sg LongSpec VLLM Inference Client

From Leeroopedia
Knowledge Sources
Domains NLP, Inference, API
Last Updated 2026-02-14 05:00 GMT

Overview

Architectural principle for interacting with vLLM inference servers via HTTP with automatic retry logic and context-length overflow handling.

Description

The vLLM Inference Client principle describes the pattern of using an HTTP client to send prompts to a vLLM server for batch inference during evaluation. The client must handle three API modes (raw prompt, OpenAI completions, OpenAI chat completions) and gracefully handle context-length overflow errors by progressively reducing max_tokens. This enables evaluation pipelines to use vLLM-hosted models without tight coupling to the model loading code.

Usage

Apply this principle when the evaluation pipeline needs to query a separately-hosted vLLM inference server rather than loading the model locally. This is common in distributed evaluation setups where the model server and evaluation orchestrator run on different machines.

Theoretical Basis

The retry pattern follows exponential backoff on context overflow:

# Abstract algorithm (NOT real implementation)
def request_with_retry(prompt, max_tokens, max_retry):
    for attempt in range(max_retry):
        response = http_post(prompt, max_tokens)
        if response.ok:
            return parse(response)
        elif "context length" in response.error:
            max_tokens -= 100  # Reduce and retry
        else:
            return empty_response

Key design decisions:

  1. Multi-API support: Single client handles raw, completions, and chat endpoints
  2. Graceful degradation: Reduce max_tokens instead of failing on context overflow
  3. Multi-sample: Support n>1 for self-consistency evaluation

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment