Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:OpenBMB UltraFeedback Environment Setup

From Leeroopedia


Knowledge Sources
Domains DevOps, ML_Infrastructure
Last Updated 2023-10-02 00:00 GMT

Overview

A dependency management and environment configuration strategy for setting up the two inference backends (HuggingFace and vLLM) used in the UltraFeedback generation pipeline.

Description

Environment Setup covers the installation and configuration of Python packages and environment variables required to run the UltraFeedback completion generation pipeline. Two distinct environment configurations exist:

HuggingFace Backend:

  • Pinned versions: transformers==4.31.0, tokenizers==0.13.3, deepspeed==0.10.0
  • Latest accelerate (-U flag)
  • Sequential single-model inference

vLLM Backend:

  • Latest versions of transformers, tokenizers, deepspeed, accelerate, and vllm
  • Environment variables: NCCL_IGNORE_DISABLED_P2P=1 (for multi-GPU NCCL communication), RAY_memory_monitor_refresh_ms=0 (disables Ray memory monitoring), CUDA_LAUNCH_BLOCKING=1 (synchronous CUDA for debugging)
  • Tensor-parallel multi-GPU inference

The key difference is that the HF backend uses pinned dependency versions for reproducibility, while the vLLM backend uses latest versions to benefit from ongoing vLLM performance improvements.

Usage

Choose the HF backend environment for single-GPU sequential inference with reproducible dependency versions. Choose the vLLM backend environment for multi-GPU batched inference with higher throughput.

Theoretical Basis

The two environments represent a trade-off between reproducibility (pinned versions) and performance (latest optimizations). The vLLM backend requires additional environment variables because:

  • NCCL_IGNORE_DISABLED_P2P=1: Works around peer-to-peer communication issues on certain GPU topologies
  • RAY_memory_monitor_refresh_ms=0: Prevents Ray (vLLM's distributed runtime) from interfering with GPU memory management
  • CUDA_LAUNCH_BLOCKING=1: Ensures synchronous CUDA execution for deterministic error reporting

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment