Heuristic:NVIDIA DALI Thread Affinity Optimization

Knowledge Sources	NVIDIA DALI DALI Performance Tuning
Domains	Optimization, Infrastructure
Last Updated	2026-02-08 16:00 GMT

Overview

CPU thread count and affinity configuration for DALI pipelines to maximize CPU-GPU overlap and minimize context-switching overhead.

Description

DALI pipelines use a configurable number of CPU worker threads (`num_threads`) for CPU-bound operations like file reading and CPU image decoding. The `DALI_AFFINITY_MASK` environment variable pins these threads to specific CPU cores, reducing context-switching and improving cache locality. The nvJPEG hybrid decoder creates its own threads that are automatically affinity-bound regardless of this setting.

Usage

Use this heuristic when tuning CPU-GPU pipeline overlap or when profiling reveals CPU thread contention as a bottleneck. Particularly important on multi-socket NUMA systems where threads may migrate between sockets.

The Insight (Rule of Thumb)

Action: Set `num_threads` in `@pipeline_def` to 4-8 for typical image training. Use `DALI_AFFINITY_MASK` to pin threads to specific CPU cores near the GPU's NUMA node.
Value: Default `num_threads=4` in ResNet50 example. Video pipelines use `num_threads=2`. Set `DALI_AFFINITY_MASK="3,5,6,10"` to pin thread 0 to CPU 3, thread 1 to CPU 5, etc.
Trade-off: More threads = higher CPU parallelism for decoding but increased context-switching overhead. Too few threads = CPU becomes bottleneck for GPU pipeline.
NUMA Awareness: If fewer CPUs specified in `DALI_AFFINITY_MASK` than `num_threads`, remaining threads use `nvmlDeviceGetCpuAffinity` to auto-select cores near the GPU.

Reasoning

DALI's pipelined execution model (`exec_pipelined=True, exec_async=True`) overlaps CPU and GPU work. CPU threads read and decode data while the GPU processes the previous batch. If CPU threads are slow (too few) or migrate between NUMA nodes (no affinity), the GPU starves for data. Pinning threads to cores near the GPU's PCIe connection minimizes memory access latency.

The default of 4 threads balances parallelism and overhead for typical ImageNet training. Video pipelines use fewer threads (2) because video decoding is primarily GPU-bound (NVDEC).

Code Evidence

Thread configuration from `docs/examples/use_cases/pytorch/resnet50/main.py:65,314`:

parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
                    help='number of data loading workers (default: 4)')

pipe = create_dali_pipeline(
    num_threads=args.workers,  # CPU worker threads
    device_id=args.local_rank,
)

Video pipeline threads from `docs/examples/sequence_processing/video/video_label_example.py:74`:

pipe = video_pipeline(
    num_threads=2,  # Fewer threads for GPU-bound video decoding
    device_id=0,
)

Affinity documentation from `docs/advanced_topics_performance_tuning.rst:8-37`:

DALI_AFFINITY_MASK environment variable:
  "3,5,6,10" maps thread 0 -> CPU 3, thread 1 -> CPU 5, etc.
  If more threads than CPUs specified, remaining use nvmlDeviceGetCpuAffinity.
  nvJPEG hybrid decoder threads are always auto-affined.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment