Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:OpenBMB UltraFeedback Python GPU Environment

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Deep_Learning, NLP
Last Updated 2026-02-08 06:00 GMT

Overview

Linux environment with CUDA-capable GPU, Python 3.8+, PyTorch, and HuggingFace Transformers 4.31.0 for local model inference.

Description

This environment provides the GPU-accelerated context required for running local LLM inference using the HuggingFace Transformers pipeline. It is used by the completion generation workflow in main.py, which loads models such as LLaMA, Vicuna, Alpaca, WizardLM, StarChat, MPT, and Falcon via the `pipeline()` API with `device_map="auto"`. The environment requires CUDA-capable NVIDIA GPUs with sufficient VRAM to host 7B to 65B parameter models. Models are loaded with `torch.bfloat16` precision where applicable, and `trust_remote_code=True` is used for MPT and Falcon models.

Usage

Use this environment for any Completion Generation workflow that runs local model inference via the HuggingFace Transformers backend. It is the mandatory prerequisite for running the Load_Generator and Multi_Backend_Inference implementations when using non-API model types. The vLLM backend has its own separate environment.

System Requirements

Category Requirement Notes
OS Linux (Ubuntu recommended) CUDA toolkit requires Linux for multi-GPU setups
Hardware NVIDIA GPU with CUDA support Minimum 16GB VRAM for 7B models; 40GB+ for 13B-65B models
Hardware Multiple GPUs recommended `device_map="auto"` distributes layers across available GPUs
Disk 100GB+ SSD Model weights for 13B model are ~26GB; larger models require more

Dependencies

System Packages

  • `cuda-toolkit` (CUDA runtime, compatible with PyTorch build)
  • `git-lfs` (for downloading large model files from HuggingFace Hub)

Python Packages

  • `torch` (with CUDA support)
  • `transformers` == 4.31.0 (pinned in run.sh)
  • `tokenizers` == 0.13.3 (pinned in run.sh)
  • `deepspeed` == 0.10.0 (pinned in run.sh)
  • `accelerate` (latest, installed with -U flag)
  • `datasets`
  • `pandas`
  • `numpy`
  • `tqdm`

Credentials

The following environment variables or configurations may be needed:

  • `HF_TOKEN`: HuggingFace API token (if downloading gated models like LLaMA-2 from the Hub).

Quick Install

# Install pinned dependencies (as specified in run.sh)
pip install transformers==4.31.0
pip install tokenizers==0.13.3
pip install deepspeed==0.10.0
pip install accelerate -U

# Additional runtime dependencies
pip install torch datasets pandas numpy tqdm

Code Evidence

Pinned dependency installation from `run.sh:1-4`:

pip install transformers==4.31.0
pip install tokenizers==0.13.3
pip install deepspeed==0.10.0
pip install accelerate -U

GPU model loading with `device_map="auto"` from `main.py:142-149`:

if model_type == "starchat":
    generator = pipeline("text-generation", model=ckpt, tokenizer=ckpt, torch_dtype=torch.bfloat16, device_map="auto")
else: # llama-series
    if model_type in ["mpt-30b-chat", "falcon-40b-instruct"]:
        generator = pipeline(model=ckpt, tokenizer=ckpt, device_map="auto", trust_remote_code=True)
    else:
        model = LlamaForCausalLM.from_pretrained(ckpt, device_map="auto")
        tokenizer = LlamaTokenizer.from_pretrained(ckpt)
        generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

CUDA seed setting from `main.py:18-23`:

def set_seed(seed):
    print("set seed:", seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)

Common Errors

Error Message Cause Solution
`CUDA out of memory` Model too large for available VRAM Use a smaller model or ensure `device_map="auto"` distributes across multiple GPUs
`ImportError: LlamaForCausalLM` Transformers version too old Install `transformers==4.31.0` as specified in run.sh
`trust_remote_code=True required` MPT or Falcon models need custom code execution Already handled in code; ensure you trust the model source

Compatibility Notes

  • bfloat16: Only StarChat explicitly uses `torch.bfloat16` dtype; other models use the default dtype.
  • device_map="auto": Requires `accelerate` package installed. Automatically distributes model layers across available GPUs and CPU RAM if needed.
  • LLaMA models: Use `LlamaForCausalLM` and `LlamaTokenizer` directly rather than the generic `pipeline()` with auto-detection.
  • trust_remote_code: Required for MPT-30B-chat and Falcon-40B-instruct models which have custom architectures.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment