Environment:Microsoft Onnxruntime CPU Training Environment

Field	Value
sources	setup.py, requirements-training.txt, orttraining/orttraining/training_ops/cpu/
domains	training, cpu-kernels, gradients, optimizers
last_updated	2026-02-10

Overview

The CPU-based training environment for executing ONNX Runtime training operator kernels (gradients, optimizers, loss functions) on standard CPUs without GPU acceleration.

Description

The CPU Training Environment provides the runtime context for executing training-specific operator kernels implemented under orttraining/orttraining/training_ops/cpu/. These kernels include gradient computations for activations (GELU, FastGELU), convolutions, pooling, batch normalization, layer normalization, recurrent networks (LSTM, GRU), tensor operations (Gather, Slice, Split, Concat), loss functions (CrossEntropy, SoftmaxCrossEntropyLoss), optimizers (AdamW, SGDv2, SGD/Adam legacy), gradient control operations (accumulation, clipping, scaling), collective communication (MPI Send/Recv), quantization (FakeQuant), and TensorBoard summary operations. The environment requires the onnxruntime-training package variant which includes these additional CPU kernels beyond the standard inference-only package. MPI support is optional and only required for the MpiSend/MpiRecv communication kernels used in distributed training scenarios.

Usage

Use this environment whenever you need to:

Execute training operator gradients on CPU (e.g., testing, debugging, or CPU-only training).
Run optimizer kernels (AdamW, SGDv2) without GPU acceleration.
Compute loss functions (CrossEntropy, SoftmaxCrossEntropyLoss) on CPU.
Perform gradient clipping, scaling, or accumulation on CPU tensors.
Use MPI-based tensor communication for distributed CPU training.

System Requirements

Requirement	Minimum	Recommended
Python	3.10	3.12
Operating System	Linux (manylinux2014), Windows, macOS	Linux x86_64
Architecture	x86_64, aarch64	x86_64
RAM	4 GB	16 GB+ (model dependent)
Disk	500 MB (package)	1 GB+ (with training data)

Dependencies

Python Packages

Package	Version	Purpose
onnxruntime-training	>= 1.25.0	Training package variant with CPU training kernels
numpy	>= 1.21.6	Tensor I/O and data manipulation
onnx	>= 1.12	ONNX model format support
flatbuffers		Checkpoint serialization
protobuf		Model format support

Optional Dependencies

Package	Purpose
mpi4py	Required only for MpiSend/MpiRecv distributed communication kernels
h5py	Checkpoint I/O in HDF5 format
cerberus	Configuration validation for training parameters

Code Evidence

Source: orttraining/orttraining/training_ops/cpu/activation/activations_grad.cc
  - GeluGrad, FastGeluGrad, BiasGeluGrad_dX, BiasFastGeluGrad_dX kernels

Source: orttraining/orttraining/training_ops/cpu/optimizer/adamw/adamw.cc
  - AdamW optimizer kernel on CPU

Source: orttraining/orttraining/training_ops/cpu/optimizer/sgd/sgd.cc
  - SGDOptimizerV2 kernel on CPU

Source: orttraining/orttraining/training_ops/cpu/loss/cross_entropy.cc
  - CrossEntropy loss and gradient on CPU

Source: orttraining/orttraining/training_ops/cpu/loss/softmax_cross_entropy_loss.cc
  - SoftmaxCrossEntropyLoss and gradient on CPU

Source: orttraining/orttraining/training_ops/cpu/communication/send.cc
  - MPI tensor send for distributed training

Source: orttraining/orttraining/training_ops/cpu/communication/recv.cc
  - MPI tensor receive for distributed training

Related Pages

Implementations Using This Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment