Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Microsoft Onnxruntime Distributed Model Training

From Leeroopedia


Knowledge Sources
Domains Distributed_Training, GPU_Computing, Large_Scale_ML
Last Updated 2026-02-10 04:30 GMT

Overview

End-to-end process for training deep learning models at scale using ONNX Runtime's distributed training infrastructure with multi-GPU and multi-node support.

Description

This workflow covers large-scale model training using ONNX Runtime's C++ training subsystem. It supports data parallelism with NCCL AllReduce for gradient synchronization, pipeline parallelism for model sharding across stages, mixed precision (FP16/BF16) training for memory efficiency and throughput, and advanced optimizers (Adam, AdamW, LAMB, SGD). The training runner orchestrates the complete lifecycle including data loading, iterative training with periodic evaluation, checkpoint management, TensorBoard logging, and memory optimization through activation recomputation. Reference implementations for GPT-2 and SqueezeNet demonstrate the pattern.

Usage

Execute this workflow when training large deep learning models (transformers, CNNs) that require multiple GPUs or nodes for acceptable training time. This is suitable for pre-training language models, fine-tuning large models on custom datasets, or any scenario where single-GPU training is too slow or the model exceeds single-GPU memory. The C++ training infrastructure is appropriate when maximum performance is critical and the model can be expressed as an ONNX graph.

Execution Steps

Step 1: Prepare Training Configuration

Define the training parameters including model paths (training graph, evaluation graph, optimizer graph), data directories, output directory, optimizer selection and hyperparameters (learning rate, weight decay), batch size, number of epochs, and distributed training settings (data parallelism degree, pipeline stages).

Key considerations:

  • Separate model graphs for training (with loss and gradients) and evaluation
  • Learning rate schedule selection (linear, polynomial, cosine)
  • Gradient accumulation steps for effective batch size scaling
  • Mixed precision configuration (FP16 or BF16 compute type)

Step 2: Initialize Runtime Environment

Set up the ONNX Runtime environment with logging configuration and execution providers. For distributed training, initialize MPI for inter-process communication, create NCCL communicators for GPU-to-GPU gradient synchronization, and assign each process to its GPU device.

Key considerations:

  • MPI rank determines the local GPU assignment
  • NCCL communicators must be created before training begins
  • Environment logging level controls diagnostic output verbosity

Step 3: Load and Partition Data

Load training and evaluation datasets using the DataLoader infrastructure. For distributed training, partition the data across workers so each process trains on a unique shard. The DataLoader supports reading from binary files with configurable batch sizes and data schemas.

Key considerations:

  • Data must be pre-processed into the expected binary format
  • Each worker loads only its partition for memory efficiency
  • Evaluation data can be replicated across workers or partitioned

Step 4: Execute Training Loop

Run the iterative training loop managed by the TrainingRunner. Each iteration performs a forward pass through the training graph, computes loss, executes the backward pass to accumulate gradients, synchronizes gradients across workers (AllReduce), updates weights with the configured optimizer, and adjusts the learning rate. Periodic evaluation runs the eval graph on validation data.

Key considerations:

  • Gradient accumulation allows effective batch sizes larger than GPU memory permits
  • AllReduce synchronizes gradients across all data-parallel workers
  • Pipeline parallelism overlaps computation across model stages
  • Loss scaling prevents gradient underflow in mixed precision

Step 5: Manage Checkpoints

Save training checkpoints at configurable intervals. Checkpoints capture all model parameters, optimizer states (momentum buffers), and training progress metadata. Support resuming training from any saved checkpoint for fault tolerance and experiment continuation.

Key considerations:

  • Checkpoint frequency balances fault tolerance against I/O overhead
  • Only retain the N most recent checkpoints to manage disk space
  • Distributed checkpoints save one file per worker for efficiency

Step 6: Monitor Training Progress

Track training metrics using TensorBoard integration. The training runner logs loss values, learning rate progression, gradient norms, and evaluation metrics. Profiling support captures per-operator timing for performance analysis and bottleneck identification.

Key considerations:

  • TensorBoard summary writers log scalar metrics per iteration
  • Profiling can be enabled for specific iteration ranges
  • Evaluation metrics provide validation accuracy and loss tracking

Step 7: Export Trained Model

After training completes, extract the final trained parameters and export an inference-ready ONNX model. The exported model strips training-specific operators (gradient computation, optimizer updates) and retains only the forward computation graph with trained weights.

Key considerations:

  • The inference model is a standard ONNX model for any ONNX Runtime session
  • Final checkpoint can be converted to inference model offline
  • Model can be further optimized with quantization or graph optimization

Execution Diagram

GitHub URL

Workflow Repository