Implementation:NVIDIA DALI GPU Affinity
| Knowledge Sources | |
|---|---|
| Domains | Vision, Training |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Provides GPU-to-CPU affinity management utilities for optimizing multi-GPU deep learning training performance on NVIDIA DGX systems.
Description
This module implements a comprehensive CPU affinity management system for multi-GPU training workloads. It queries the NVML (NVIDIA Management Library) to discover the hardware topology between GPUs and CPU sockets, then assigns appropriate CPU core affinities to training processes to match the physical CPU-GPU connectivity. This is critical for achieving optimal and stable performance on multi-socket systems like NVIDIA DGX A100 and DGX-1.
The module provides six affinity modes via the AffinityMode enum: none (no affinity setting), socket (all cores from the connected CPU socket), socket_single (first core from the connected socket, may overlap), socket_single_unique (single unique core per GPU), socket_unique_interleaved (unique core subset with interleaved assignment), and socket_unique_contiguous (unique core subset with contiguous assignment, the recommended default). The implementation handles hyperthreading siblings by grouping cores through the Linux sysfs topology interface at /sys/devices/system/cpu/.
The main entry point is the set_affinity function, which accepts a GPU index, process count, affinity mode, core selection mode (all_logical or single_logical), and a balanced flag. Helper classes like Device wrap NVML calls for querying GPU CPU affinity bitmasks. The module is designed for the multi-process single-device training pattern used by torch.nn.parallel.DistributedDataParallel.
Usage
Use this module at the start of each training process in a multi-GPU distributed training setup to pin the process to the CPU cores physically connected to its assigned GPU. This is particularly important on DGX A100 where only half the CPU cores have direct GPU access. Call set_affinity with the local GPU rank before initializing the training workload.
Code Reference
Source Location
- Repository: NVIDIA_DALI
- File: docs/examples/use_cases/pytorch/efficientnet/image_classification/gpu_affinity.py
- Lines: 1-417
Signature
class AffinityMode(Enum):
none = auto()
socket = auto()
socket_single = auto()
socket_single_unique = auto()
socket_unique_interleaved = auto()
socket_unique_contiguous = auto()
class Device:
def __init__(self, device_idx): ...
def get_name(self): ...
def get_uuid(self): ...
def get_cpu_affinity(self): ...
def set_affinity(gpu_id, nproc_per_node=None, *,
mode=AffinityMode.socket_unique_contiguous,
cores="all_logical", balanced=True): ...
def get_socket_affinities(nproc_per_node, exclude_unavailable_cores=True): ...
def set_socket_affinity(gpu_id, nproc_per_node, cores): ...
def set_socket_unique_affinity(gpu_id, nproc_per_node, cores, mode, balanced=True): ...
Import
import gpu_affinity
I/O Contract
Inputs (set_affinity)
| Name | Type | Required | Description |
|---|---|---|---|
| gpu_id | int | Yes | Integer index of the GPU (0 to nproc_per_node - 1). |
| nproc_per_node | int | No | Number of training processes per node. Default: auto-detected via NVML. |
| mode | str or AffinityMode | No | Affinity mode to use. Default: socket_unique_contiguous. |
| cores | str | No | Core selection: "all_logical" (includes hyperthreading) or "single_logical". Default: "all_logical". |
| balanced | bool | No | Whether to assign equal number of physical cores per process. Default: True. |
Outputs (set_affinity)
| Name | Type | Description |
|---|---|---|
| affinity | set[int] | Set of logical CPU core indices on which the process is eligible to run after affinity is applied. |
Usage Examples
Setting affinity in distributed training
import argparse
import os
import gpu_affinity
import torch
def main():
parser = argparse.ArgumentParser()
parser.add_argument(
'--local_rank', type=int,
default=os.getenv('LOCAL_RANK', 0),
)
args = parser.parse_args()
nproc_per_node = torch.cuda.device_count()
affinity = gpu_affinity.set_affinity(args.local_rank, nproc_per_node)
print(f'{args.local_rank}: core affinity: {affinity}')
# Launch with:
# python -m torch.distributed.launch --nproc_per_node <#GPUs> example.py
Using a specific affinity mode
import gpu_affinity
# Use socket-level affinity (all cores from connected socket)
affinity = gpu_affinity.set_affinity(
gpu_id=0,
nproc_per_node=8,
mode="socket",
cores="single_logical"
)