Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft Onnxruntime PipelinePoc

From Leeroopedia


Knowledge Sources
Domains Training, Pipeline, Distributed
Last Updated 2026-02-10 04:00 GMT

Overview

A proof-of-concept application demonstrating pipeline parallel (P2P) training across multiple MPI ranks using CUDA and ONNX Runtime inference sessions.

Description

This `main.cc` file implements a pipeline parallelism proof-of-concept for distributed training. It requires both CUDA and MPI to be enabled (guarded by `#if defined(USE_CUDA) && defined(USE_MPI)`). The application:

1. Parses command-line arguments using `cxxopts` for `num_steps`, `model_stage0_name`, `model_stage1_name`, and `model_stage2_name`. 2. Initializes MPI to determine `world_rank` and `world_size`. 3. Creates an ONNX Runtime Environment and SessionOptions with detailed configuration including sequential execution mode, memory patterns, CPU memory arena, and Level1 graph optimization. 4. Registers CUDA EP on each rank using the rank as the CUDA device ID. 5. Loads a stage-specific model per rank: rank 0 loads stage0, rank 1 loads stage1, rank 2 loads stage2. 6. Executes pipeline rounds: For each round:

  - Rank 0 creates input tensors (`X1`) and feeds them through its model, receiving output `X8`.
  - Rank 1 runs with no explicit feeds, receiving outputs `X2` and `X4` (data arrives via MPI send/receive nodes in the graph).
  - Rank 2 runs with no explicit feeds, receiving output `X3`.
  - `MPI_Barrier` synchronizes all ranks between rounds.

7. Finalizes MPI after all rounds complete.

If CUDA or MPI is not available, the program outputs an "ORT_NOT_IMPLEMENTED" error.

Usage

Use this as a reference implementation for pipeline parallel training with ONNX Runtime. It demonstrates how to partition a model across multiple processes/GPUs using MPI for inter-process communication.

Code Reference

Source Location

Signature

struct Parameters {
  size_t num_steps;
  std::string model_stage0_name;
  std::string model_stage1_name;
  std::string model_stage2_name;
};

Status parse_arguments(int argc, char* argv[], Parameters& params);
int main(int argc, char* argv[]);

Import

#include "core/session/environment.h"
#include "orttraining/core/session/training_session.h"
#include "orttraining/models/runner/training_util.h"
#include "orttraining/models/runner/data_loader.h"

I/O Contract

Component Inputs Outputs Description
Command-line args --num_steps, --model_stage0_name, --model_stage1_name, --model_stage2_name Parameters struct Pipeline configuration
Rank 0 session X1 tensor (2x2 float) X8 tensor First pipeline stage with explicit input
Rank 1 session (implicit MPI receive) X2, X4 tensors Middle pipeline stage
Rank 2 session (implicit MPI receive) X3 tensor Final pipeline stage

Usage Examples

// Run with MPI:
// mpirun -np 3 ./pipeline_poc \
//   --num_steps 10 \
//   --model_stage0_name stage0.onnx \
//   --model_stage1_name stage1.onnx \
//   --model_stage2_name stage2.onnx

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment