Implementation:NVIDIA DALI Paddle ResNet Training
| Knowledge Sources | |
|---|---|
| Domains | Vision, Training |
| Last Updated | 2026-02-08 16:00 GMT |
Overview
Orchestrates the full PaddlePaddle static-graph training and evaluation program for ResNet-50 with NVIDIA DALI data loading integration.
Description
This module provides the complete training program logic for the PaddlePaddle ResNet-50 example. It operates in PaddlePaddle's static graph mode (paddle.static) and contains functions for building the computational graph, compiling programs, running training/evaluation loops, and managing distributed training.
The core workflow is: (1) create_feeds creates input placeholders for image data and labels, (2) build assembles the full program by instantiating the model, creating loss/metric fetch operations, and configuring the optimizer with optional AMP/ASP support, (3) compile_prog compiles the program with operator fusion optimizations, and (4) run executes the training or evaluation loop over a DALI data iterator, collecting metrics like loss, top-1/top-5 accuracy, throughput (images/sec), and latency.
The module supports distributed training via PaddlePaddle Fleet with NCCL collective communication, automatic mixed precision (AMP) with dynamic loss scaling, automatic sparsity (ASP) with configurable mask algorithms, and benchmark mode with warmup steps. The run function integrates directly with nvidia.dali.plugin.paddle.DALIGenericIterator as its data source, demonstrating DALI's role as a high-performance data pipeline replacement.
Usage
Use this module as the main training program when running the PaddlePaddle ResNet-50 DALI example. It is called from the top-level training script after configuring arguments via the config module and setting up the DALI pipeline.
Code Reference
Source Location
- Repository: NVIDIA_DALI
- File: docs/examples/use_cases/paddle/resnet50/program.py
- Lines: 1-447
Signature
def create_feeds(image_shape): ...
def create_fetchs(out, feeds, class_num, label_smoothing=0, mode=Mode.TRAIN): ...
def create_strategy(args, is_train=True): ...
def dist_optimizer(args, optimizer): ...
def build(args, main_prog, startup_prog, step_each_epoch, is_train=True): ...
def compile_prog(args, program, loss_name=None, is_train=True): ...
def run(args, dataloader, exe, program, fetchs, epoch,
mode=Mode.TRAIN, lr_scheduler=None): ...
def log_info(step, metrics, mode): ...
Import
from program import create_feeds, create_fetchs, build, compile_prog, run
I/O Contract
Inputs (build function)
| Name | Type | Required | Description |
|---|---|---|---|
| args | Namespace | Yes | Parsed command-line arguments containing model, optimizer, and training configuration. |
| main_prog | paddle.static.Program | Yes | The main program to build the computation graph in. |
| startup_prog | paddle.static.Program | Yes | The startup program for parameter initialization. |
| step_each_epoch | int | Yes | Number of training steps per epoch, used for learning rate scheduling. |
| is_train | bool | No | Whether to build for training (True) or evaluation (False). Default: True. |
Outputs (build function)
| Name | Type | Description |
|---|---|---|
| fetchs | dict | Dictionary mapping metric names (loss, top1, top5) to (variable, AverageMeter) tuples. |
| lr_scheduler | paddle.optimizer.lr.LRScheduler | Learning rate scheduler instance (None if is_train=False). |
| feeds | dict | Dictionary mapping feed names ('data', 'label') to static data placeholders. |
| optimizer | Optimizer | Distributed optimizer with AMP/ASP configuration (None if is_train=False). |
Inputs (run function)
| Name | Type | Required | Description |
|---|---|---|---|
| args | Namespace | Yes | Parsed command-line arguments. |
| dataloader | DALIGenericIterator | Yes | NVIDIA DALI data loader iterator producing batches. |
| exe | paddle.static.Executor | Yes | PaddlePaddle static executor to run the program. |
| program | paddle.static.Program | Yes | Compiled program to execute. |
| fetchs | dict | Yes | Fetch variables and meters from the build step. |
| epoch | int | Yes | Current epoch number. |
| mode | Mode | No | Training or evaluation mode. Default: Mode.TRAIN. |
| lr_scheduler | LRScheduler | No | Learning rate scheduler to step per iteration. Default: None. |
Outputs (run function)
| Name | Type | Description |
|---|---|---|
| epoch_data | dict | Dictionary of epoch-level metrics including loss, epoch_time, ips, top1, top5 (eval only). |
Usage Examples
Building and running a training program
import paddle
from program import build, compile_prog, run
from utils.mode import Mode
paddle.enable_static()
main_prog = paddle.static.Program()
startup_prog = paddle.static.Program()
# Build training program
fetchs, lr_scheduler, feeds, optimizer = build(
args, main_prog, startup_prog, step_each_epoch=5005, is_train=True
)
# Compile with operator fusion
compiled_prog = compile_prog(args, main_prog, loss_name='loss', is_train=True)
# Execute training
exe = paddle.static.Executor(paddle.CUDAPlace(0))
exe.run(startup_prog)
metrics = run(args, dali_dataloader, exe, compiled_prog, fetchs,
epoch=0, mode=Mode.TRAIN, lr_scheduler=lr_scheduler)