Principle:OpenBMB UltraFeedback Inference Execution
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference, Model_Serving |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
A multi-backend inference strategy that generates text completions using API calls, HuggingFace pipelines, or vLLM batch inference depending on the model type.
Description
Inference Execution is the core generation step of the UltraFeedback pipeline. Once the model is loaded and the prompt is formatted, the pipeline dispatches inference to one of three backends:
- API Backend: For GPT-4 and GPT-3.5-turbo, inference goes through openai.ChatCompletion.create with temperature=1.0, max_tokens=1024, and top_p=1.0. The API_Caller wrapper handles retries.
- HuggingFace Pipeline Backend: For local models loaded via HuggingFace, the pipeline is called with num_return_sequences=1, return_full_text=False, handle_long_generation="hole", temperature=1.0, and max_new_tokens=1024. A StoppingCriteria list enforces model-specific stop tokens. Post-processing strips newlines and truncates at quadruple newline boundaries.
- vLLM Batch Backend: The vLLM backend uses SamplingParams with temperature=1, top_p=1, max_tokens=1024, and model-specific stop strings. It performs batch inference across all prompts at once using generator.generate(prompts, sampling_params), which is significantly faster than sequential HF pipeline calls.
All backends use temperature=1.0 (not 0) because the goal is to produce diverse completions for preference annotation, not deterministic outputs.
Usage
Use this principle when generating diverse completions from multiple models for preference dataset construction. The high temperature setting is deliberate: it produces varied outputs that create meaningful preference signals during annotation.
Theoretical Basis
The inference strategy makes a fundamental trade-off between throughput and flexibility:
- The HF pipeline processes one example at a time (sequential), supporting per-example stop criteria
- The vLLM backend processes all prompts in a batch (parallel), achieving much higher GPU utilization
- The API backend is network-bound and rate-limited
Pseudo-code Logic:
# Abstract algorithm
def generate(backend, prompt_or_prompts, params):
if backend == "api":
return api_caller(system_prompt, user_prompt)
elif backend == "huggingface":
result = pipeline(prompt, max_new_tokens=1024, temperature=1.0, ...)
return postprocess(result[0]["generated_text"])
elif backend == "vllm":
sampling_params = SamplingParams(temperature=1, max_tokens=1024, stop=stop_tokens)
results = llm.generate(prompts, sampling_params)
return [r.outputs[0].text.strip() for r in results]