Principle:OpenBMB UltraFeedback Environment Setup
| Knowledge Sources | |
|---|---|
| Domains | DevOps, ML_Infrastructure |
| Last Updated | 2023-10-02 00:00 GMT |
Overview
A dependency management and environment configuration strategy for setting up the two inference backends (HuggingFace and vLLM) used in the UltraFeedback generation pipeline.
Description
Environment Setup covers the installation and configuration of Python packages and environment variables required to run the UltraFeedback completion generation pipeline. Two distinct environment configurations exist:
HuggingFace Backend:
- Pinned versions: transformers==4.31.0, tokenizers==0.13.3, deepspeed==0.10.0
- Latest accelerate (-U flag)
- Sequential single-model inference
vLLM Backend:
- Latest versions of transformers, tokenizers, deepspeed, accelerate, and vllm
- Environment variables: NCCL_IGNORE_DISABLED_P2P=1 (for multi-GPU NCCL communication), RAY_memory_monitor_refresh_ms=0 (disables Ray memory monitoring), CUDA_LAUNCH_BLOCKING=1 (synchronous CUDA for debugging)
- Tensor-parallel multi-GPU inference
The key difference is that the HF backend uses pinned dependency versions for reproducibility, while the vLLM backend uses latest versions to benefit from ongoing vLLM performance improvements.
Usage
Choose the HF backend environment for single-GPU sequential inference with reproducible dependency versions. Choose the vLLM backend environment for multi-GPU batched inference with higher throughput.
Theoretical Basis
The two environments represent a trade-off between reproducibility (pinned versions) and performance (latest optimizations). The vLLM backend requires additional environment variables because:
- NCCL_IGNORE_DISABLED_P2P=1: Works around peer-to-peer communication issues on certain GPU topologies
- RAY_memory_monitor_refresh_ms=0: Prevents Ray (vLLM's distributed runtime) from interfering with GPU memory management
- CUDA_LAUNCH_BLOCKING=1: Ensures synchronous CUDA execution for deterministic error reporting