Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenBMB UltraFeedback Inference Execution

From Leeroopedia


Knowledge Sources
Domains NLP, Inference, Model_Serving
Last Updated 2023-10-02 00:00 GMT

Overview

A multi-backend inference strategy that generates text completions using API calls, HuggingFace pipelines, or vLLM batch inference depending on the model type.

Description

Inference Execution is the core generation step of the UltraFeedback pipeline. Once the model is loaded and the prompt is formatted, the pipeline dispatches inference to one of three backends:

  1. API Backend: For GPT-4 and GPT-3.5-turbo, inference goes through openai.ChatCompletion.create with temperature=1.0, max_tokens=1024, and top_p=1.0. The API_Caller wrapper handles retries.
  2. HuggingFace Pipeline Backend: For local models loaded via HuggingFace, the pipeline is called with num_return_sequences=1, return_full_text=False, handle_long_generation="hole", temperature=1.0, and max_new_tokens=1024. A StoppingCriteria list enforces model-specific stop tokens. Post-processing strips newlines and truncates at quadruple newline boundaries.
  3. vLLM Batch Backend: The vLLM backend uses SamplingParams with temperature=1, top_p=1, max_tokens=1024, and model-specific stop strings. It performs batch inference across all prompts at once using generator.generate(prompts, sampling_params), which is significantly faster than sequential HF pipeline calls.

All backends use temperature=1.0 (not 0) because the goal is to produce diverse completions for preference annotation, not deterministic outputs.

Usage

Use this principle when generating diverse completions from multiple models for preference dataset construction. The high temperature setting is deliberate: it produces varied outputs that create meaningful preference signals during annotation.

Theoretical Basis

The inference strategy makes a fundamental trade-off between throughput and flexibility:

  • The HF pipeline processes one example at a time (sequential), supporting per-example stop criteria
  • The vLLM backend processes all prompts in a batch (parallel), achieving much higher GPU utilization
  • The API backend is network-bound and rate-limited

Pseudo-code Logic:

# Abstract algorithm
def generate(backend, prompt_or_prompts, params):
    if backend == "api":
        return api_caller(system_prompt, user_prompt)
    elif backend == "huggingface":
        result = pipeline(prompt, max_new_tokens=1024, temperature=1.0, ...)
        return postprocess(result[0]["generated_text"])
    elif backend == "vllm":
        sampling_params = SamplingParams(temperature=1, max_tokens=1024, stop=stop_tokens)
        results = llm.generate(prompts, sampling_params)
        return [r.outputs[0].text.strip() for r in results]

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment