Implementation:OpenBMB UltraFeedback Load Generator
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
Concrete tool for loading language models via a factory function that dispatches to API, HuggingFace Pipeline, or vLLM backends based on model type.
Description
The load_generator function is defined in both main.py (HuggingFace backend) and main_vllm.py (vLLM backend). The HF version handles three cases: API models return an API_Caller instance, StarChat and MPT/Falcon models use pipeline with dtype/trust settings, and LLaMA-family models use explicit LlamaForCausalLM.from_pretrained + LlamaTokenizer loading. The vLLM version uses a single vllm.LLM constructor with tensor parallelism.
The API_Caller class (main.py:L94-118) wraps openai.ChatCompletion.create with retry logic (up to 20 retries) and provides a __call__ interface matching the pipeline pattern.
Usage
Call load_generator(model_type) at the start of a generation run. The returned object is used as generator(prompt, ...) for local models or generator(system_prompt, user_prompt) for API models.
Code Reference
Source Location
- Repository: UltraFeedback
- File: src/comparison_data_generation/main.py (Lines 94-118 for API_Caller, Lines 135-151 for load_generator)
- File: src/comparison_data_generation/main_vllm.py (Lines 87-95 for load_generator)
Signature
# HuggingFace backend (main.py)
class API_Caller:
def __init__(self, model: str):
self.model = model
def __call__(self, system_prompt: str, user_prompt: str) -> str:
"""Calls openai.ChatCompletion.create with retry logic (20 attempts).
Returns generated content string."""
...
def load_generator(model_type: str):
"""Factory function returning a generator object.
Args:
model_type: Model identifier (e.g., 'gpt-4', 'ultralm-13b', 'starchat')
Returns:
API_Caller for GPT models, pipeline for HF models
"""
if model_type in ["gpt-4", "gpt-3.5-turbo"]:
return API_Caller(model_type)
else:
ckpt = model_path[model_type]
if model_type == "starchat":
return pipeline("text-generation", model=ckpt, tokenizer=ckpt,
torch_dtype=torch.bfloat16, device_map="auto")
elif model_type in ["mpt-30b-chat", "falcon-40b-instruct"]:
return pipeline(model=ckpt, tokenizer=ckpt, device_map="auto",
trust_remote_code=True)
else: # llama-series
model = LlamaForCausalLM.from_pretrained(ckpt, device_map="auto")
tokenizer = LlamaTokenizer.from_pretrained(ckpt)
return pipeline("text-generation", model=model, tokenizer=tokenizer)
# vLLM backend (main_vllm.py)
def load_generator(model_type: str):
"""Factory function returning a vLLM LLM instance.
Args:
model_type: Model identifier
Returns:
vllm.LLM instance with tensor parallelism
"""
ckpt = model_path[model_type]
dtype = "auto" if model_type not in ["starchat", "mpt-30b-chat", "falcon-40b-instruct"] else "bfloat16"
return LLM(ckpt, gpu_memory_utilization=0.95, swap_space=1,
tensor_parallel_size=torch.cuda.device_count(),
trust_remote_code=True, dtype=dtype)
Import
# HuggingFace backend
from transformers import pipeline, LlamaForCausalLM, LlamaTokenizer
import torch
import openai
# vLLM backend
from vllm import LLM
import torch
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_type | str | Yes | Model identifier key (e.g., "gpt-4", "ultralm-13b", "starchat", "falcon-40b-instruct") |
| model_path | Dict[str, str] | Yes | Module-level dict mapping model_type to HuggingFace checkpoint path |
Outputs
| Name | Type | Description |
|---|---|---|
| generator | Union[API_Caller, pipeline, LLM] | Callable generator: API_Caller for GPT models, HF pipeline for local HF models, vLLM LLM for vLLM backend |
Usage Examples
HuggingFace Backend
from main import load_generator
# Load a LLaMA-family model
generator = load_generator("ultralm-13b")
# Returns a HuggingFace pipeline
# Generate a completion
response = generator(
prompt,
num_return_sequences=1,
return_full_text=False,
temperature=1.0,
top_p=1.0,
max_new_tokens=1024,
do_sample=True
)
text = response[0]["generated_text"]
vLLM Backend
from main_vllm import load_generator
from vllm import SamplingParams
# Load model with tensor parallelism
generator = load_generator("ultralm-13b")
# Returns a vLLM LLM instance
# Batch inference
sampling_params = SamplingParams(temperature=1, top_p=1, max_tokens=1024)
responses = generator.generate(prompts, sampling_params)
API Backend
from main import load_generator
# Load API caller for GPT-4
generator = load_generator("gpt-4")
# Returns an API_Caller instance
# Generate via OpenAI API
response = generator(
system_prompt="You are a helpful assistant.",
user_prompt="Explain quantum computing."
)
Related Pages
Implements Principle
Requires Environment
- Environment:OpenBMB_UltraFeedback_Python_GPU_Environment
- Environment:OpenBMB_UltraFeedback_vLLM_Multi_GPU_Environment
- Environment:OpenBMB_UltraFeedback_OpenAI_API_Environment