Principle:OpenBMB UltraFeedback Model Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Inference, Model_Serving |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
A multi-backend model loading strategy that dynamically selects the appropriate inference engine (API, HuggingFace Pipeline, or vLLM) based on model type.
Description
Model Loading in the UltraFeedback pipeline is implemented as a dispatch pattern that routes model loading to one of three backends depending on the model identifier:
- API Backend: Commercial models (GPT-4, GPT-3.5-turbo) are wrapped in an API caller class that communicates via the OpenAI ChatCompletion API. No local model weights are loaded.
- HuggingFace Pipeline Backend: Open-source models are loaded using transformers.pipeline with model-specific configurations. LLaMA-family models use LlamaForCausalLM.from_pretrained with a separate LlamaTokenizer, while other architectures (StarChat, MPT, Falcon) use auto-detection with trust_remote_code=True.
- vLLM Backend: An alternative high-throughput backend using vllm.LLM with tensor parallelism across all available GPUs and 95% GPU memory utilization.
The key design insight is that a single generation pipeline must handle 17 models spanning fundamentally different architectures and access patterns, requiring a unified interface that abstracts away backend differences.
Usage
Use this principle when building multi-model inference pipelines where models span commercial APIs and various open-source architectures. The dispatch pattern allows a single CLI argument (model_type) to control which backend and loading configuration is used.
Theoretical Basis
The loading strategy follows a factory pattern where the model_type string determines which construction path is taken. This is necessary because:
- API models require network configuration, not GPU memory
- Different model architectures need different tokenizer and model class combinations
- Tensor parallelism configuration varies by backend (HF uses device_map="auto", vLLM uses explicit tensor_parallel_size)
Pseudo-code Logic:
# Abstract algorithm
def load_generator(model_type: str) -> Generator:
if model_type in API_MODELS:
return API_Caller(model_type)
elif backend == "huggingface":
model, tokenizer = load_hf_model(model_type)
return pipeline("text-generation", model=model, tokenizer=tokenizer)
elif backend == "vllm":
return LLM(checkpoint, tensor_parallel_size=num_gpus)