Implementation:Microsoft Onnxruntime PipelinePoc
| Knowledge Sources | |
|---|---|
| Domains | Training, Pipeline, Distributed |
| Last Updated | 2026-02-10 04:00 GMT |
Overview
A proof-of-concept application demonstrating pipeline parallel (P2P) training across multiple MPI ranks using CUDA and ONNX Runtime inference sessions.
Description
This `main.cc` file implements a pipeline parallelism proof-of-concept for distributed training. It requires both CUDA and MPI to be enabled (guarded by `#if defined(USE_CUDA) && defined(USE_MPI)`). The application:
1. Parses command-line arguments using `cxxopts` for `num_steps`, `model_stage0_name`, `model_stage1_name`, and `model_stage2_name`. 2. Initializes MPI to determine `world_rank` and `world_size`. 3. Creates an ONNX Runtime Environment and SessionOptions with detailed configuration including sequential execution mode, memory patterns, CPU memory arena, and Level1 graph optimization. 4. Registers CUDA EP on each rank using the rank as the CUDA device ID. 5. Loads a stage-specific model per rank: rank 0 loads stage0, rank 1 loads stage1, rank 2 loads stage2. 6. Executes pipeline rounds: For each round:
- Rank 0 creates input tensors (`X1`) and feeds them through its model, receiving output `X8`. - Rank 1 runs with no explicit feeds, receiving outputs `X2` and `X4` (data arrives via MPI send/receive nodes in the graph). - Rank 2 runs with no explicit feeds, receiving output `X3`. - `MPI_Barrier` synchronizes all ranks between rounds.
7. Finalizes MPI after all rounds complete.
If CUDA or MPI is not available, the program outputs an "ORT_NOT_IMPLEMENTED" error.
Usage
Use this as a reference implementation for pipeline parallel training with ONNX Runtime. It demonstrates how to partition a model across multiple processes/GPUs using MPI for inter-process communication.
Code Reference
Source Location
- Repository: Microsoft_Onnxruntime
- File: orttraining/orttraining/models/pipeline_poc/main.cc
- Lines: 1-236
Signature
struct Parameters {
size_t num_steps;
std::string model_stage0_name;
std::string model_stage1_name;
std::string model_stage2_name;
};
Status parse_arguments(int argc, char* argv[], Parameters& params);
int main(int argc, char* argv[]);
Import
#include "core/session/environment.h"
#include "orttraining/core/session/training_session.h"
#include "orttraining/models/runner/training_util.h"
#include "orttraining/models/runner/data_loader.h"
I/O Contract
| Component | Inputs | Outputs | Description |
|---|---|---|---|
| Command-line args | --num_steps, --model_stage0_name, --model_stage1_name, --model_stage2_name | Parameters struct | Pipeline configuration |
| Rank 0 session | X1 tensor (2x2 float) | X8 tensor | First pipeline stage with explicit input |
| Rank 1 session | (implicit MPI receive) | X2, X4 tensors | Middle pipeline stage |
| Rank 2 session | (implicit MPI receive) | X3 tensor | Final pipeline stage |
Usage Examples
// Run with MPI:
// mpirun -np 3 ./pipeline_poc \
// --num_steps 10 \
// --model_stage0_name stage0.onnx \
// --model_stage1_name stage1.onnx \
// --model_stage2_name stage2.onnx