Heuristic:OpenBMB UltraFeedback GPU Memory Utilization

Knowledge Sources	OpenBMB UltraFeedback vLLM Documentation
Domains	Optimization, Infrastructure, Deep_Learning
Last Updated	2026-02-08 06:00 GMT

Overview

Memory optimization techniques using `gpu_memory_utilization=0.95`, `device_map="auto"`, and `swap_space=1` to maximize GPU throughput for LLM inference.

Description

The UltraFeedback completion generation pipelines use two complementary memory management strategies. For the HuggingFace backend, `device_map="auto"` automatically distributes model layers across all available GPUs and overflows to CPU RAM. For the vLLM backend, `gpu_memory_utilization=0.95` allocates 95% of GPU memory to the KV cache, leaving only 5% as headroom. Additionally, `swap_space=1` enables 1GB of CPU swap space for emergency offloading. These aggressive settings maximize batch size and throughput at the cost of reduced stability margin.

Usage

Use this heuristic when running large model inference and need to maximize throughput. The 0.95 utilization setting is appropriate for dedicated inference servers where no other GPU workloads are running. Reduce to 0.7-0.8 for shared GPU environments or when encountering OOM errors.

The Insight (Rule of Thumb)

Action: Set `gpu_memory_utilization=0.95` for vLLM and `device_map="auto"` for HuggingFace Transformers.
Value: 0.95 (95% of GPU VRAM allocated to vLLM); `swap_space=1` (1GB CPU offload buffer).
Trade-off: Higher utilization means larger batch sizes and better throughput, but leaves minimal headroom for memory spikes. OOM errors become more likely during long sequences or large batches. The 5% margin (typically ~2GB on 40GB GPUs) provides minimal safety buffer.

Reasoning

vLLM uses PagedAttention to manage GPU memory as virtual memory pages, making high utilization safe in practice. The 0.95 setting is near the practical maximum; vLLM's own default is 0.90. The UltraFeedback project pushes this to 0.95 because completion generation uses fixed-length outputs (`max_tokens=1024`) with predictable memory patterns. The `device_map="auto"` in the HuggingFace backend leverages the `accelerate` library to automatically shard large models across GPUs, making it possible to load 30B-65B parameter models on multi-GPU setups without manual layer assignment.

Code Evidence

vLLM memory configuration from `main_vllm.py:91-92`:

gpu_memory_utilization = 0.95
model = LLM(ckpt, gpu_memory_utilization=gpu_memory_utilization, swap_space=1, tensor_parallel_size=torch.cuda.device_count(), trust_remote_code=True, dtype=dtype)

HuggingFace device_map="auto" from `main.py:142-148`:

if model_type == "starchat":
    generator = pipeline("text-generation", model=ckpt, tokenizer=ckpt, torch_dtype=torch.bfloat16, device_map="auto")
else:
    if model_type in ["mpt-30b-chat", "falcon-40b-instruct"]:
        generator = pipeline(model=ckpt, tokenizer=ckpt, device_map="auto", trust_remote_code=True)
    else:
        model = LlamaForCausalLM.from_pretrained(ckpt, device_map="auto")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment