Implementation:Microsoft Onnxruntime PipelinePoc

Knowledge Sources	Microsoft_Onnxruntime
Domains	Training, Pipeline, Distributed
Last Updated	2026-02-10 04:00 GMT

Overview

A proof-of-concept application demonstrating pipeline parallel (P2P) training across multiple MPI ranks using CUDA and ONNX Runtime inference sessions.

Description

This `main.cc` file implements a pipeline parallelism proof-of-concept for distributed training. It requires both CUDA and MPI to be enabled (guarded by `#if defined(USE_CUDA) && defined(USE_MPI)`). The application:

1. Parses command-line arguments using `cxxopts` for `num_steps`, `model_stage0_name`, `model_stage1_name`, and `model_stage2_name`. 2. Initializes MPI to determine `world_rank` and `world_size`. 3. Creates an ONNX Runtime Environment and SessionOptions with detailed configuration including sequential execution mode, memory patterns, CPU memory arena, and Level1 graph optimization. 4. Registers CUDA EP on each rank using the rank as the CUDA device ID. 5. Loads a stage-specific model per rank: rank 0 loads stage0, rank 1 loads stage1, rank 2 loads stage2. 6. Executes pipeline rounds: For each round:

  - Rank 0 creates input tensors (`X1`) and feeds them through its model, receiving output `X8`.
  - Rank 1 runs with no explicit feeds, receiving outputs `X2` and `X4` (data arrives via MPI send/receive nodes in the graph).
  - Rank 2 runs with no explicit feeds, receiving output `X3`.
  - `MPI_Barrier` synchronizes all ranks between rounds.

7. Finalizes MPI after all rounds complete.

If CUDA or MPI is not available, the program outputs an "ORT_NOT_IMPLEMENTED" error.

Usage

Use this as a reference implementation for pipeline parallel training with ONNX Runtime. It demonstrates how to partition a model across multiple processes/GPUs using MPI for inter-process communication.

Code Reference

Source Location

Repository: Microsoft_Onnxruntime
File: orttraining/orttraining/models/pipeline_poc/main.cc
Lines: 1-236

Signature

struct Parameters {
  size_t num_steps;
  std::string model_stage0_name;
  std::string model_stage1_name;
  std::string model_stage2_name;
};

Status parse_arguments(int argc, char* argv[], Parameters& params);
int main(int argc, char* argv[]);

Import

#include "core/session/environment.h"
#include "orttraining/core/session/training_session.h"
#include "orttraining/models/runner/training_util.h"
#include "orttraining/models/runner/data_loader.h"

I/O Contract

Component	Inputs	Outputs	Description
Command-line args	--num_steps, --model_stage0_name, --model_stage1_name, --model_stage2_name	Parameters struct	Pipeline configuration
Rank 0 session	X1 tensor (2x2 float)	X8 tensor	First pipeline stage with explicit input
Rank 1 session	(implicit MPI receive)	X2, X4 tensors	Middle pipeline stage
Rank 2 session	(implicit MPI receive)	X3 tensor	Final pipeline stage

Usage Examples

// Run with MPI:
// mpirun -np 3 ./pipeline_poc \
//   --num_steps 10 \
//   --model_stage0_name stage0.onnx \
//   --model_stage1_name stage1.onnx \
//   --model_stage2_name stage2.onnx

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment