Heuristic:OpenBMB UltraFeedback GPU Memory Utilization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure, Deep_Learning |
| Last Updated | 2026-02-08 06:00 GMT |
Overview
Memory optimization techniques using `gpu_memory_utilization=0.95`, `device_map="auto"`, and `swap_space=1` to maximize GPU throughput for LLM inference.
Description
The UltraFeedback completion generation pipelines use two complementary memory management strategies. For the HuggingFace backend, `device_map="auto"` automatically distributes model layers across all available GPUs and overflows to CPU RAM. For the vLLM backend, `gpu_memory_utilization=0.95` allocates 95% of GPU memory to the KV cache, leaving only 5% as headroom. Additionally, `swap_space=1` enables 1GB of CPU swap space for emergency offloading. These aggressive settings maximize batch size and throughput at the cost of reduced stability margin.
Usage
Use this heuristic when running large model inference and need to maximize throughput. The 0.95 utilization setting is appropriate for dedicated inference servers where no other GPU workloads are running. Reduce to 0.7-0.8 for shared GPU environments or when encountering OOM errors.
The Insight (Rule of Thumb)
- Action: Set `gpu_memory_utilization=0.95` for vLLM and `device_map="auto"` for HuggingFace Transformers.
- Value: 0.95 (95% of GPU VRAM allocated to vLLM); `swap_space=1` (1GB CPU offload buffer).
- Trade-off: Higher utilization means larger batch sizes and better throughput, but leaves minimal headroom for memory spikes. OOM errors become more likely during long sequences or large batches. The 5% margin (typically ~2GB on 40GB GPUs) provides minimal safety buffer.
Reasoning
vLLM uses PagedAttention to manage GPU memory as virtual memory pages, making high utilization safe in practice. The 0.95 setting is near the practical maximum; vLLM's own default is 0.90. The UltraFeedback project pushes this to 0.95 because completion generation uses fixed-length outputs (`max_tokens=1024`) with predictable memory patterns. The `device_map="auto"` in the HuggingFace backend leverages the `accelerate` library to automatically shard large models across GPUs, making it possible to load 30B-65B parameter models on multi-GPU setups without manual layer assignment.
Code Evidence
vLLM memory configuration from `main_vllm.py:91-92`:
gpu_memory_utilization = 0.95
model = LLM(ckpt, gpu_memory_utilization=gpu_memory_utilization, swap_space=1, tensor_parallel_size=torch.cuda.device_count(), trust_remote_code=True, dtype=dtype)
HuggingFace device_map="auto" from `main.py:142-148`:
if model_type == "starchat":
generator = pipeline("text-generation", model=ckpt, tokenizer=ckpt, torch_dtype=torch.bfloat16, device_map="auto")
else:
if model_type in ["mpt-30b-chat", "falcon-40b-instruct"]:
generator = pipeline(model=ckpt, tokenizer=ckpt, device_map="auto", trust_remote_code=True)
else:
model = LlamaForCausalLM.from_pretrained(ckpt, device_map="auto")