Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:NVIDIA DALI GPU Affinity

From Leeroopedia


Knowledge Sources
Domains Vision, Training
Last Updated 2026-02-08 16:00 GMT

Overview

Provides GPU-to-CPU affinity management utilities for optimizing multi-GPU deep learning training performance on NVIDIA DGX systems.

Description

This module implements a comprehensive CPU affinity management system for multi-GPU training workloads. It queries the NVML (NVIDIA Management Library) to discover the hardware topology between GPUs and CPU sockets, then assigns appropriate CPU core affinities to training processes to match the physical CPU-GPU connectivity. This is critical for achieving optimal and stable performance on multi-socket systems like NVIDIA DGX A100 and DGX-1.

The module provides six affinity modes via the AffinityMode enum: none (no affinity setting), socket (all cores from the connected CPU socket), socket_single (first core from the connected socket, may overlap), socket_single_unique (single unique core per GPU), socket_unique_interleaved (unique core subset with interleaved assignment), and socket_unique_contiguous (unique core subset with contiguous assignment, the recommended default). The implementation handles hyperthreading siblings by grouping cores through the Linux sysfs topology interface at /sys/devices/system/cpu/.

The main entry point is the set_affinity function, which accepts a GPU index, process count, affinity mode, core selection mode (all_logical or single_logical), and a balanced flag. Helper classes like Device wrap NVML calls for querying GPU CPU affinity bitmasks. The module is designed for the multi-process single-device training pattern used by torch.nn.parallel.DistributedDataParallel.

Usage

Use this module at the start of each training process in a multi-GPU distributed training setup to pin the process to the CPU cores physically connected to its assigned GPU. This is particularly important on DGX A100 where only half the CPU cores have direct GPU access. Call set_affinity with the local GPU rank before initializing the training workload.

Code Reference

Source Location

Signature

class AffinityMode(Enum):
    none = auto()
    socket = auto()
    socket_single = auto()
    socket_single_unique = auto()
    socket_unique_interleaved = auto()
    socket_unique_contiguous = auto()

class Device:
    def __init__(self, device_idx): ...
    def get_name(self): ...
    def get_uuid(self): ...
    def get_cpu_affinity(self): ...

def set_affinity(gpu_id, nproc_per_node=None, *,
                 mode=AffinityMode.socket_unique_contiguous,
                 cores="all_logical", balanced=True): ...

def get_socket_affinities(nproc_per_node, exclude_unavailable_cores=True): ...
def set_socket_affinity(gpu_id, nproc_per_node, cores): ...
def set_socket_unique_affinity(gpu_id, nproc_per_node, cores, mode, balanced=True): ...

Import

import gpu_affinity

I/O Contract

Inputs (set_affinity)

Name Type Required Description
gpu_id int Yes Integer index of the GPU (0 to nproc_per_node - 1).
nproc_per_node int No Number of training processes per node. Default: auto-detected via NVML.
mode str or AffinityMode No Affinity mode to use. Default: socket_unique_contiguous.
cores str No Core selection: "all_logical" (includes hyperthreading) or "single_logical". Default: "all_logical".
balanced bool No Whether to assign equal number of physical cores per process. Default: True.

Outputs (set_affinity)

Name Type Description
affinity set[int] Set of logical CPU core indices on which the process is eligible to run after affinity is applied.

Usage Examples

Setting affinity in distributed training

import argparse
import os
import gpu_affinity
import torch

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--local_rank', type=int,
        default=os.getenv('LOCAL_RANK', 0),
    )
    args = parser.parse_args()

    nproc_per_node = torch.cuda.device_count()
    affinity = gpu_affinity.set_affinity(args.local_rank, nproc_per_node)
    print(f'{args.local_rank}: core affinity: {affinity}')

# Launch with:
# python -m torch.distributed.launch --nproc_per_node <#GPUs> example.py

Using a specific affinity mode

import gpu_affinity

# Use socket-level affinity (all cores from connected socket)
affinity = gpu_affinity.set_affinity(
    gpu_id=0,
    nproc_per_node=8,
    mode="socket",
    cores="single_logical"
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment