Implementation:NVIDIA DALI GPU Affinity

Knowledge Sources	NVIDIA_DALI
Domains	Vision, Training
Last Updated	2026-02-08 16:00 GMT

Overview

Provides GPU-to-CPU affinity management utilities for optimizing multi-GPU deep learning training performance on NVIDIA DGX systems.

Description

This module implements a comprehensive CPU affinity management system for multi-GPU training workloads. It queries the NVML (NVIDIA Management Library) to discover the hardware topology between GPUs and CPU sockets, then assigns appropriate CPU core affinities to training processes to match the physical CPU-GPU connectivity. This is critical for achieving optimal and stable performance on multi-socket systems like NVIDIA DGX A100 and DGX-1.

The module provides six affinity modes via the AffinityMode enum: none (no affinity setting), socket (all cores from the connected CPU socket), socket_single (first core from the connected socket, may overlap), socket_single_unique (single unique core per GPU), socket_unique_interleaved (unique core subset with interleaved assignment), and socket_unique_contiguous (unique core subset with contiguous assignment, the recommended default). The implementation handles hyperthreading siblings by grouping cores through the Linux sysfs topology interface at /sys/devices/system/cpu/.

The main entry point is the set_affinity function, which accepts a GPU index, process count, affinity mode, core selection mode (all_logical or single_logical), and a balanced flag. Helper classes like Device wrap NVML calls for querying GPU CPU affinity bitmasks. The module is designed for the multi-process single-device training pattern used by torch.nn.parallel.DistributedDataParallel.

Usage

Use this module at the start of each training process in a multi-GPU distributed training setup to pin the process to the CPU cores physically connected to its assigned GPU. This is particularly important on DGX A100 where only half the CPU cores have direct GPU access. Call set_affinity with the local GPU rank before initializing the training workload.

Code Reference

Source Location

Repository: NVIDIA_DALI
File: docs/examples/use_cases/pytorch/efficientnet/image_classification/gpu_affinity.py
Lines: 1-417

Signature

class AffinityMode(Enum):
    none = auto()
    socket = auto()
    socket_single = auto()
    socket_single_unique = auto()
    socket_unique_interleaved = auto()
    socket_unique_contiguous = auto()

class Device:
    def __init__(self, device_idx): ...
    def get_name(self): ...
    def get_uuid(self): ...
    def get_cpu_affinity(self): ...

def set_affinity(gpu_id, nproc_per_node=None, *,
                 mode=AffinityMode.socket_unique_contiguous,
                 cores="all_logical", balanced=True): ...

def get_socket_affinities(nproc_per_node, exclude_unavailable_cores=True): ...
def set_socket_affinity(gpu_id, nproc_per_node, cores): ...
def set_socket_unique_affinity(gpu_id, nproc_per_node, cores, mode, balanced=True): ...

Import

import gpu_affinity

I/O Contract

Inputs (set_affinity)

Name	Type	Required	Description
gpu_id	int	Yes	Integer index of the GPU (0 to nproc_per_node - 1).
nproc_per_node	int	No	Number of training processes per node. Default: auto-detected via NVML.
mode	str or AffinityMode	No	Affinity mode to use. Default: socket_unique_contiguous.
cores	str	No	Core selection: "all_logical" (includes hyperthreading) or "single_logical". Default: "all_logical".
balanced	bool	No	Whether to assign equal number of physical cores per process. Default: True.

Outputs (set_affinity)

Name	Type	Description
affinity	set[int]	Set of logical CPU core indices on which the process is eligible to run after affinity is applied.

Usage Examples

Setting affinity in distributed training

import argparse
import os
import gpu_affinity
import torch

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--local_rank', type=int,
        default=os.getenv('LOCAL_RANK', 0),
    )
    args = parser.parse_args()

    nproc_per_node = torch.cuda.device_count()
    affinity = gpu_affinity.set_affinity(args.local_rank, nproc_per_node)
    print(f'{args.local_rank}: core affinity: {affinity}')

# Launch with:
# python -m torch.distributed.launch --nproc_per_node <#GPUs> example.py

Using a specific affinity mode

import gpu_affinity

# Use socket-level affinity (all cores from connected socket)
affinity = gpu_affinity.set_affinity(
    gpu_id=0,
    nproc_per_node=8,
    mode="socket",
    cores="single_logical"
)

Related Pages

Environment:NVIDIA_DALI_CUDA_GPU_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment