Heuristic:NVIDIA DALI Thread Affinity Optimization
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
CPU thread count and affinity configuration for DALI pipelines to maximize CPU-GPU overlap and minimize context-switching overhead.
Description
DALI pipelines use a configurable number of CPU worker threads (`num_threads`) for CPU-bound operations like file reading and CPU image decoding. The `DALI_AFFINITY_MASK` environment variable pins these threads to specific CPU cores, reducing context-switching and improving cache locality. The nvJPEG hybrid decoder creates its own threads that are automatically affinity-bound regardless of this setting.
Usage
Use this heuristic when tuning CPU-GPU pipeline overlap or when profiling reveals CPU thread contention as a bottleneck. Particularly important on multi-socket NUMA systems where threads may migrate between sockets.
The Insight (Rule of Thumb)
- Action: Set `num_threads` in `@pipeline_def` to 4-8 for typical image training. Use `DALI_AFFINITY_MASK` to pin threads to specific CPU cores near the GPU's NUMA node.
- Value: Default `num_threads=4` in ResNet50 example. Video pipelines use `num_threads=2`. Set `DALI_AFFINITY_MASK="3,5,6,10"` to pin thread 0 to CPU 3, thread 1 to CPU 5, etc.
- Trade-off: More threads = higher CPU parallelism for decoding but increased context-switching overhead. Too few threads = CPU becomes bottleneck for GPU pipeline.
- NUMA Awareness: If fewer CPUs specified in `DALI_AFFINITY_MASK` than `num_threads`, remaining threads use `nvmlDeviceGetCpuAffinity` to auto-select cores near the GPU.
Reasoning
DALI's pipelined execution model (`exec_pipelined=True, exec_async=True`) overlaps CPU and GPU work. CPU threads read and decode data while the GPU processes the previous batch. If CPU threads are slow (too few) or migrate between NUMA nodes (no affinity), the GPU starves for data. Pinning threads to cores near the GPU's PCIe connection minimizes memory access latency.
The default of 4 threads balances parallelism and overhead for typical ImageNet training. Video pipelines use fewer threads (2) because video decoding is primarily GPU-bound (NVDEC).
Code Evidence
Thread configuration from `docs/examples/use_cases/pytorch/resnet50/main.py:65,314`:
parser.add_argument('-j', '--workers', default=4, type=int, metavar='N',
help='number of data loading workers (default: 4)')
pipe = create_dali_pipeline(
num_threads=args.workers, # CPU worker threads
device_id=args.local_rank,
)
Video pipeline threads from `docs/examples/sequence_processing/video/video_label_example.py:74`:
pipe = video_pipeline(
num_threads=2, # Fewer threads for GPU-bound video decoding
device_id=0,
)
Affinity documentation from `docs/advanced_topics_performance_tuning.rst:8-37`:
DALI_AFFINITY_MASK environment variable:
"3,5,6,10" maps thread 0 -> CPU 3, thread 1 -> CPU 5, etc.
If more threads than CPUs specified, remaining use nvmlDeviceGetCpuAffinity.
nvJPEG hybrid decoder threads are always auto-affined.