Implementation:OpenBMB UltraFeedback Multi Backend Inference

Knowledge Sources	UltraFeedback HuggingFace Transformers vLLM
Domains	NLP, Inference
Last Updated	2023-10-02 00:00 GMT

Overview

Concrete tool for generating text completions across API, HuggingFace pipeline, and vLLM backends in the UltraFeedback pipeline.

Description

The inference execution is split across two modules:

main.py (HuggingFace backend): The instruction_completion function (L157-222) handles both API and local inference. For API models, it calls generator(system_prompt, user_prompt) which invokes API_Caller.__call__ → openai.ChatCompletion.create. For local models, it calls the HuggingFace pipeline with generation parameters and applies post-processing (strip newlines, split on quadruple newlines).

main_vllm.py (vLLM backend): The instruction_completion function (L161-190) takes an entire dataset, constructs SamplingParams with model-specific stop tokens, and runs generator.generate(dataset["prompt"], sampling_params) for batch inference. Responses are stripped and cleaned of tokens, then merged back into the dataset.

Usage

The HF backend is called via dataset.map(instruction_completion) for sequential per-example processing. The vLLM backend is called as instruction_completion(dataset) for batch processing of the entire dataset at once.

Code Reference

Source Location

Repository: UltraFeedback
File: src/comparison_data_generation/main.py (Lines 99-118 for API_Caller.__call__, Lines 207-213 for HF inference)
File: src/comparison_data_generation/main_vllm.py (Lines 161-190 for vLLM batch inference)

Signature

# HuggingFace backend: instruction_completion (main.py:L157-222)
@torch.no_grad()
def instruction_completion(example: Dict) -> Dict:
    """Generates a completion for a single example using the loaded generator.

    For API models: generator(system_prompt=principle_prompt, user_prompt=instruction)
    For local models: generator(prompt, num_return_sequences=1, return_full_text=False,
                               handle_long_generation="hole", temperature=1.0, top_p=1.0,
                               max_new_tokens=1024, do_sample=True,
                               stopping_criteria=stopping_criteria)

    Appends result to example["completions"] as dict with keys:
        model, principle, custom_system_prompt, response
    """
    ...

# vLLM backend: instruction_completion (main_vllm.py:L161-190)
@torch.no_grad()
def instruction_completion(dataset: datasets.Dataset) -> datasets.Dataset:
    """Batch inference over entire dataset using vLLM.

    Constructs SamplingParams(temperature=1, top_p=1, max_tokens=1024, stop=stop)
    Calls generator.generate(dataset["prompt"], sampling_params)
    Merges responses back into dataset completions.
    """
    ...

# API_Caller.__call__ (main.py:L99-118)
def __call__(self, system_prompt: str, user_prompt: str) -> str:
    """Calls openai.ChatCompletion.create with temperature=1, max_tokens=1024, top_p=1.
    Retries up to 20 times on failure."""
    ...

Import

# HuggingFace backend
from transformers import pipeline
import torch
import openai

# vLLM backend
from vllm import LLM, SamplingParams
import torch

I/O Contract

Inputs

Name	Type	Required	Description
example / dataset	Dict / datasets.Dataset	Yes	Single example (HF) or full dataset (vLLM) with prompt field
generator	Union[API_Caller, pipeline, LLM]	Yes	Loaded model (global variable)
model_type	str	Yes	Global variable: model identifier for dispatch logic
stopping_criteria	StoppingCriteriaList	No	HF backend only: model-specific stop token criteria

Outputs

Name	Type	Description
example["completions"]	List[Dict]	Appended with dict: {model, principle, custom_system_prompt, response}
response	str	Generated text, stripped and cleaned of stop tokens

Usage Examples

HuggingFace Backend (Sequential)

# Called via dataset.map
dataset = dataset.map(instruction_completion, desc=f"{model_type} on {subset}")
# Each example gets one completion appended to example["completions"]

vLLM Backend (Batch)

from vllm import SamplingParams

# Batch inference over full dataset
sampling_params = SamplingParams(temperature=1, top_p=1, max_tokens=1024, stop=stop)
responses = generator.generate(dataset["prompt"], sampling_params)
responses = [r.outputs[0].text.strip().rstrip("</s>").strip() for r in responses]

# Merge responses back
dataset = dataset.add_column("response", responses)
dataset = dataset.map(lambda x: {
    "completions": x["completions"][:-1] + [
        dict(x["completions"][-1], **{"response": x["response"]})
    ]
})
dataset = dataset.remove_columns(["prompt", "response"])

Related Pages

Implements Principle

Principle:OpenBMB_UltraFeedback_Inference_Execution

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment