Implementation:OpenBMB UltraFeedback Load Generator

Knowledge Sources	UltraFeedback HuggingFace Transformers vLLM
Domains	NLP, Inference
Last Updated	2023-10-02 00:00 GMT

Overview

Concrete tool for loading language models via a factory function that dispatches to API, HuggingFace Pipeline, or vLLM backends based on model type.

Description

The load_generator function is defined in both main.py (HuggingFace backend) and main_vllm.py (vLLM backend). The HF version handles three cases: API models return an API_Caller instance, StarChat and MPT/Falcon models use pipeline with dtype/trust settings, and LLaMA-family models use explicit LlamaForCausalLM.from_pretrained + LlamaTokenizer loading. The vLLM version uses a single vllm.LLM constructor with tensor parallelism.

The API_Caller class (main.py:L94-118) wraps openai.ChatCompletion.create with retry logic (up to 20 retries) and provides a __call__ interface matching the pipeline pattern.

Usage

Call load_generator(model_type) at the start of a generation run. The returned object is used as generator(prompt, ...) for local models or generator(system_prompt, user_prompt) for API models.

Code Reference

Source Location

Repository: UltraFeedback
File: src/comparison_data_generation/main.py (Lines 94-118 for API_Caller, Lines 135-151 for load_generator)
File: src/comparison_data_generation/main_vllm.py (Lines 87-95 for load_generator)

Signature

# HuggingFace backend (main.py)
class API_Caller:
    def __init__(self, model: str):
        self.model = model

    def __call__(self, system_prompt: str, user_prompt: str) -> str:
        """Calls openai.ChatCompletion.create with retry logic (20 attempts).
        Returns generated content string."""
        ...

def load_generator(model_type: str):
    """Factory function returning a generator object.

    Args:
        model_type: Model identifier (e.g., 'gpt-4', 'ultralm-13b', 'starchat')

    Returns:
        API_Caller for GPT models, pipeline for HF models
    """
    if model_type in ["gpt-4", "gpt-3.5-turbo"]:
        return API_Caller(model_type)
    else:
        ckpt = model_path[model_type]
        if model_type == "starchat":
            return pipeline("text-generation", model=ckpt, tokenizer=ckpt,
                          torch_dtype=torch.bfloat16, device_map="auto")
        elif model_type in ["mpt-30b-chat", "falcon-40b-instruct"]:
            return pipeline(model=ckpt, tokenizer=ckpt, device_map="auto",
                          trust_remote_code=True)
        else:  # llama-series
            model = LlamaForCausalLM.from_pretrained(ckpt, device_map="auto")
            tokenizer = LlamaTokenizer.from_pretrained(ckpt)
            return pipeline("text-generation", model=model, tokenizer=tokenizer)

# vLLM backend (main_vllm.py)
def load_generator(model_type: str):
    """Factory function returning a vLLM LLM instance.

    Args:
        model_type: Model identifier

    Returns:
        vllm.LLM instance with tensor parallelism
    """
    ckpt = model_path[model_type]
    dtype = "auto" if model_type not in ["starchat", "mpt-30b-chat", "falcon-40b-instruct"] else "bfloat16"
    return LLM(ckpt, gpu_memory_utilization=0.95, swap_space=1,
               tensor_parallel_size=torch.cuda.device_count(),
               trust_remote_code=True, dtype=dtype)

Import

# HuggingFace backend
from transformers import pipeline, LlamaForCausalLM, LlamaTokenizer
import torch
import openai

# vLLM backend
from vllm import LLM
import torch

I/O Contract

Inputs

Name	Type	Required	Description
model_type	str	Yes	Model identifier key (e.g., "gpt-4", "ultralm-13b", "starchat", "falcon-40b-instruct")
model_path	Dict[str, str]	Yes	Module-level dict mapping model_type to HuggingFace checkpoint path

Outputs

Name	Type	Description
generator	Union[API_Caller, pipeline, LLM]	Callable generator: API_Caller for GPT models, HF pipeline for local HF models, vLLM LLM for vLLM backend

Usage Examples

HuggingFace Backend

from main import load_generator

# Load a LLaMA-family model
generator = load_generator("ultralm-13b")
# Returns a HuggingFace pipeline

# Generate a completion
response = generator(
    prompt,
    num_return_sequences=1,
    return_full_text=False,
    temperature=1.0,
    top_p=1.0,
    max_new_tokens=1024,
    do_sample=True
)
text = response[0]["generated_text"]

vLLM Backend

from main_vllm import load_generator
from vllm import SamplingParams

# Load model with tensor parallelism
generator = load_generator("ultralm-13b")
# Returns a vLLM LLM instance

# Batch inference
sampling_params = SamplingParams(temperature=1, top_p=1, max_tokens=1024)
responses = generator.generate(prompts, sampling_params)

API Backend

from main import load_generator

# Load API caller for GPT-4
generator = load_generator("gpt-4")
# Returns an API_Caller instance

# Generate via OpenAI API
response = generator(
    system_prompt="You are a helpful assistant.",
    user_prompt="Explain quantum computing."
)

Related Pages

Implements Principle

Principle:OpenBMB_UltraFeedback_Model_Loading

Requires Environment

Uses Heuristic

Heuristic:OpenBMB_UltraFeedback_GPU_Memory_Utilization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment