Principle:Sail sg LongSpec VLLM Inference Client
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference, API |
| Last Updated | 2026-02-14 05:00 GMT |
Overview
Architectural principle for interacting with vLLM inference servers via HTTP with automatic retry logic and context-length overflow handling.
Description
The vLLM Inference Client principle describes the pattern of using an HTTP client to send prompts to a vLLM server for batch inference during evaluation. The client must handle three API modes (raw prompt, OpenAI completions, OpenAI chat completions) and gracefully handle context-length overflow errors by progressively reducing max_tokens. This enables evaluation pipelines to use vLLM-hosted models without tight coupling to the model loading code.
Usage
Apply this principle when the evaluation pipeline needs to query a separately-hosted vLLM inference server rather than loading the model locally. This is common in distributed evaluation setups where the model server and evaluation orchestrator run on different machines.
Theoretical Basis
The retry pattern follows exponential backoff on context overflow:
# Abstract algorithm (NOT real implementation)
def request_with_retry(prompt, max_tokens, max_retry):
for attempt in range(max_retry):
response = http_post(prompt, max_tokens)
if response.ok:
return parse(response)
elif "context length" in response.error:
max_tokens -= 100 # Reduce and retry
else:
return empty_response
Key design decisions:
- Multi-API support: Single client handles raw, completions, and chat endpoints
- Graceful degradation: Reduce max_tokens instead of failing on context overflow
- Multi-sample: Support n>1 for self-consistency evaluation