Environment:OpenBMB UltraFeedback Python GPU Environment

Knowledge Sources	OpenBMB UltraFeedback HuggingFace Transformers
Domains	Infrastructure, Deep_Learning, NLP
Last Updated	2026-02-08 06:00 GMT

Overview

Linux environment with CUDA-capable GPU, Python 3.8+, PyTorch, and HuggingFace Transformers 4.31.0 for local model inference.

Description

This environment provides the GPU-accelerated context required for running local LLM inference using the HuggingFace Transformers pipeline. It is used by the completion generation workflow in main.py, which loads models such as LLaMA, Vicuna, Alpaca, WizardLM, StarChat, MPT, and Falcon via the `pipeline()` API with `device_map="auto"`. The environment requires CUDA-capable NVIDIA GPUs with sufficient VRAM to host 7B to 65B parameter models. Models are loaded with `torch.bfloat16` precision where applicable, and `trust_remote_code=True` is used for MPT and Falcon models.

Usage

Use this environment for any Completion Generation workflow that runs local model inference via the HuggingFace Transformers backend. It is the mandatory prerequisite for running the Load_Generator and Multi_Backend_Inference implementations when using non-API model types. The vLLM backend has its own separate environment.

System Requirements

Category	Requirement	Notes
OS	Linux (Ubuntu recommended)	CUDA toolkit requires Linux for multi-GPU setups
Hardware	NVIDIA GPU with CUDA support	Minimum 16GB VRAM for 7B models; 40GB+ for 13B-65B models
Hardware	Multiple GPUs recommended	`device_map="auto"` distributes layers across available GPUs
Disk	100GB+ SSD	Model weights for 13B model are ~26GB; larger models require more

Dependencies

System Packages

`cuda-toolkit` (CUDA runtime, compatible with PyTorch build)
`git-lfs` (for downloading large model files from HuggingFace Hub)

Python Packages

`torch` (with CUDA support)
`transformers` == 4.31.0 (pinned in run.sh)
`tokenizers` == 0.13.3 (pinned in run.sh)
`deepspeed` == 0.10.0 (pinned in run.sh)
`accelerate` (latest, installed with -U flag)
`datasets`
`pandas`
`numpy`
`tqdm`

Credentials

The following environment variables or configurations may be needed:

`HF_TOKEN`: HuggingFace API token (if downloading gated models like LLaMA-2 from the Hub).

Quick Install

# Install pinned dependencies (as specified in run.sh)
pip install transformers==4.31.0
pip install tokenizers==0.13.3
pip install deepspeed==0.10.0
pip install accelerate -U

# Additional runtime dependencies
pip install torch datasets pandas numpy tqdm

Code Evidence

Pinned dependency installation from `run.sh:1-4`:

pip install transformers==4.31.0
pip install tokenizers==0.13.3
pip install deepspeed==0.10.0
pip install accelerate -U

GPU model loading with `device_map="auto"` from `main.py:142-149`:

if model_type == "starchat":
    generator = pipeline("text-generation", model=ckpt, tokenizer=ckpt, torch_dtype=torch.bfloat16, device_map="auto")
else: # llama-series
    if model_type in ["mpt-30b-chat", "falcon-40b-instruct"]:
        generator = pipeline(model=ckpt, tokenizer=ckpt, device_map="auto", trust_remote_code=True)
    else:
        model = LlamaForCausalLM.from_pretrained(ckpt, device_map="auto")
        tokenizer = LlamaTokenizer.from_pretrained(ckpt)
        generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

CUDA seed setting from `main.py:18-23`:

def set_seed(seed):
    print("set seed:", seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)

Common Errors

Error Message	Cause	Solution
`CUDA out of memory`	Model too large for available VRAM	Use a smaller model or ensure `device_map="auto"` distributes across multiple GPUs
`ImportError: LlamaForCausalLM`	Transformers version too old	Install `transformers==4.31.0` as specified in run.sh
`trust_remote_code=True required`	MPT or Falcon models need custom code execution	Already handled in code; ensure you trust the model source

Compatibility Notes

bfloat16: Only StarChat explicitly uses `torch.bfloat16` dtype; other models use the default dtype.
device_map="auto": Requires `accelerate` package installed. Automatically distributes model layers across available GPUs and CPU RAM if needed.
LLaMA models: Use `LlamaForCausalLM` and `LlamaTokenizer` directly rather than the generic `pipeline()` with auto-detection.
trust_remote_code: Required for MPT-30B-chat and Falcon-40B-instruct models which have custom architectures.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment