Principle:OpenBMB UltraFeedback Inference Execution

Knowledge Sources	UltraFeedback UltraFeedback
Domains	NLP, Inference, Model_Serving
Last Updated	2023-10-02 00:00 GMT

Overview

A multi-backend inference strategy that generates text completions using API calls, HuggingFace pipelines, or vLLM batch inference depending on the model type.

Description

Inference Execution is the core generation step of the UltraFeedback pipeline. Once the model is loaded and the prompt is formatted, the pipeline dispatches inference to one of three backends:

API Backend: For GPT-4 and GPT-3.5-turbo, inference goes through openai.ChatCompletion.create with temperature=1.0, max_tokens=1024, and top_p=1.0. The API_Caller wrapper handles retries.
HuggingFace Pipeline Backend: For local models loaded via HuggingFace, the pipeline is called with num_return_sequences=1, return_full_text=False, handle_long_generation="hole", temperature=1.0, and max_new_tokens=1024. A StoppingCriteria list enforces model-specific stop tokens. Post-processing strips newlines and truncates at quadruple newline boundaries.
vLLM Batch Backend: The vLLM backend uses SamplingParams with temperature=1, top_p=1, max_tokens=1024, and model-specific stop strings. It performs batch inference across all prompts at once using generator.generate(prompts, sampling_params), which is significantly faster than sequential HF pipeline calls.

All backends use temperature=1.0 (not 0) because the goal is to produce diverse completions for preference annotation, not deterministic outputs.

Usage

Use this principle when generating diverse completions from multiple models for preference dataset construction. The high temperature setting is deliberate: it produces varied outputs that create meaningful preference signals during annotation.

Theoretical Basis

The inference strategy makes a fundamental trade-off between throughput and flexibility:

The HF pipeline processes one example at a time (sequential), supporting per-example stop criteria
The vLLM backend processes all prompts in a batch (parallel), achieving much higher GPU utilization
The API backend is network-bound and rate-limited

Pseudo-code Logic:

# Abstract algorithm
def generate(backend, prompt_or_prompts, params):
    if backend == "api":
        return api_caller(system_prompt, user_prompt)
    elif backend == "huggingface":
        result = pipeline(prompt, max_new_tokens=1024, temperature=1.0, ...)
        return postprocess(result[0]["generated_text"])
    elif backend == "vllm":
        sampling_params = SamplingParams(temperature=1, max_tokens=1024, stop=stop_tokens)
        results = llm.generate(prompts, sampling_params)
        return [r.outputs[0].text.strip() for r in results]

Related Pages

Implemented By

Implementation:OpenBMB_UltraFeedback_Multi_Backend_Inference

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment