Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenBMB UltraFeedback Model Loading

From Leeroopedia


Knowledge Sources
Domains NLP, Inference, Model_Serving
Last Updated 2023-10-02 00:00 GMT

Overview

A multi-backend model loading strategy that dynamically selects the appropriate inference engine (API, HuggingFace Pipeline, or vLLM) based on model type.

Description

Model Loading in the UltraFeedback pipeline is implemented as a dispatch pattern that routes model loading to one of three backends depending on the model identifier:

  1. API Backend: Commercial models (GPT-4, GPT-3.5-turbo) are wrapped in an API caller class that communicates via the OpenAI ChatCompletion API. No local model weights are loaded.
  2. HuggingFace Pipeline Backend: Open-source models are loaded using transformers.pipeline with model-specific configurations. LLaMA-family models use LlamaForCausalLM.from_pretrained with a separate LlamaTokenizer, while other architectures (StarChat, MPT, Falcon) use auto-detection with trust_remote_code=True.
  3. vLLM Backend: An alternative high-throughput backend using vllm.LLM with tensor parallelism across all available GPUs and 95% GPU memory utilization.

The key design insight is that a single generation pipeline must handle 17 models spanning fundamentally different architectures and access patterns, requiring a unified interface that abstracts away backend differences.

Usage

Use this principle when building multi-model inference pipelines where models span commercial APIs and various open-source architectures. The dispatch pattern allows a single CLI argument (model_type) to control which backend and loading configuration is used.

Theoretical Basis

The loading strategy follows a factory pattern where the model_type string determines which construction path is taken. This is necessary because:

  • API models require network configuration, not GPU memory
  • Different model architectures need different tokenizer and model class combinations
  • Tensor parallelism configuration varies by backend (HF uses device_map="auto", vLLM uses explicit tensor_parallel_size)

Pseudo-code Logic:

# Abstract algorithm
def load_generator(model_type: str) -> Generator:
    if model_type in API_MODELS:
        return API_Caller(model_type)
    elif backend == "huggingface":
        model, tokenizer = load_hf_model(model_type)
        return pipeline("text-generation", model=model, tokenizer=tokenizer)
    elif backend == "vllm":
        return LLM(checkpoint, tensor_parallel_size=num_gpus)

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment