Environment:OpenBMB UltraFeedback Python GPU Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Deep_Learning, NLP |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Linux environment with CUDA-capable GPU, Python 3.8+, PyTorch, and HuggingFace Transformers 4.31.0 for local model inference.
Description
This environment provides the GPU-accelerated context required for running local LLM inference using the HuggingFace Transformers pipeline. It is used by the completion generation workflow in main.py, which loads models such as LLaMA, Vicuna, Alpaca, WizardLM, StarChat, MPT, and Falcon via the `pipeline()` API with `device_map="auto"`. The environment requires CUDA-capable NVIDIA GPUs with sufficient VRAM to host 7B to 65B parameter models. Models are loaded with `torch.bfloat16` precision where applicable, and `trust_remote_code=True` is used for MPT and Falcon models.
Usage
Use this environment for any Completion Generation workflow that runs local model inference via the HuggingFace Transformers backend. It is the mandatory prerequisite for running the Load_Generator and Multi_Backend_Inference implementations when using non-API model types. The vLLM backend has its own separate environment.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux (Ubuntu recommended) | CUDA toolkit requires Linux for multi-GPU setups |
| Hardware | NVIDIA GPU with CUDA support | Minimum 16GB VRAM for 7B models; 40GB+ for 13B-65B models |
| Hardware | Multiple GPUs recommended | `device_map="auto"` distributes layers across available GPUs |
| Disk | 100GB+ SSD | Model weights for 13B model are ~26GB; larger models require more |
Dependencies
System Packages
- `cuda-toolkit` (CUDA runtime, compatible with PyTorch build)
- `git-lfs` (for downloading large model files from HuggingFace Hub)
Python Packages
- `torch` (with CUDA support)
- `transformers` == 4.31.0 (pinned in run.sh)
- `tokenizers` == 0.13.3 (pinned in run.sh)
- `deepspeed` == 0.10.0 (pinned in run.sh)
- `accelerate` (latest, installed with -U flag)
- `datasets`
- `pandas`
- `numpy`
- `tqdm`
Credentials
The following environment variables or configurations may be needed:
- `HF_TOKEN`: HuggingFace API token (if downloading gated models like LLaMA-2 from the Hub).
Quick Install
# Install pinned dependencies (as specified in run.sh)
pip install transformers==4.31.0
pip install tokenizers==0.13.3
pip install deepspeed==0.10.0
pip install accelerate -U
# Additional runtime dependencies
pip install torch datasets pandas numpy tqdm
Code Evidence
Pinned dependency installation from `run.sh:1-4`:
pip install transformers==4.31.0
pip install tokenizers==0.13.3
pip install deepspeed==0.10.0
pip install accelerate -U
GPU model loading with `device_map="auto"` from `main.py:142-149`:
if model_type == "starchat":
generator = pipeline("text-generation", model=ckpt, tokenizer=ckpt, torch_dtype=torch.bfloat16, device_map="auto")
else: # llama-series
if model_type in ["mpt-30b-chat", "falcon-40b-instruct"]:
generator = pipeline(model=ckpt, tokenizer=ckpt, device_map="auto", trust_remote_code=True)
else:
model = LlamaForCausalLM.from_pretrained(ckpt, device_map="auto")
tokenizer = LlamaTokenizer.from_pretrained(ckpt)
generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
CUDA seed setting from `main.py:18-23`:
def set_seed(seed):
print("set seed:", seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `CUDA out of memory` | Model too large for available VRAM | Use a smaller model or ensure `device_map="auto"` distributes across multiple GPUs |
| `ImportError: LlamaForCausalLM` | Transformers version too old | Install `transformers==4.31.0` as specified in run.sh |
| `trust_remote_code=True required` | MPT or Falcon models need custom code execution | Already handled in code; ensure you trust the model source |
Compatibility Notes
- bfloat16: Only StarChat explicitly uses `torch.bfloat16` dtype; other models use the default dtype.
- device_map="auto": Requires `accelerate` package installed. Automatically distributes model layers across available GPUs and CPU RAM if needed.
- LLaMA models: Use `LlamaForCausalLM` and `LlamaTokenizer` directly rather than the generic `pipeline()` with auto-detection.
- trust_remote_code: Required for MPT-30B-chat and Falcon-40B-instruct models which have custom architectures.