Environment:Microsoft Onnxruntime CPU Training Environment
| Field | Value |
|---|---|
| sources | setup.py, requirements-training.txt, orttraining/orttraining/training_ops/cpu/ |
| domains | training, cpu-kernels, gradients, optimizers |
| last_updated | 2026-02-10 |
Overview
The CPU-based training environment for executing ONNX Runtime training operator kernels (gradients, optimizers, loss functions) on standard CPUs without GPU acceleration.
Description
The CPU Training Environment provides the runtime context for executing training-specific operator kernels implemented under orttraining/orttraining/training_ops/cpu/. These kernels include gradient computations for activations (GELU, FastGELU), convolutions, pooling, batch normalization, layer normalization, recurrent networks (LSTM, GRU), tensor operations (Gather, Slice, Split, Concat), loss functions (CrossEntropy, SoftmaxCrossEntropyLoss), optimizers (AdamW, SGDv2, SGD/Adam legacy), gradient control operations (accumulation, clipping, scaling), collective communication (MPI Send/Recv), quantization (FakeQuant), and TensorBoard summary operations. The environment requires the onnxruntime-training package variant which includes these additional CPU kernels beyond the standard inference-only package. MPI support is optional and only required for the MpiSend/MpiRecv communication kernels used in distributed training scenarios.
Usage
Use this environment whenever you need to:
- Execute training operator gradients on CPU (e.g., testing, debugging, or CPU-only training).
- Run optimizer kernels (AdamW, SGDv2) without GPU acceleration.
- Compute loss functions (CrossEntropy, SoftmaxCrossEntropyLoss) on CPU.
- Perform gradient clipping, scaling, or accumulation on CPU tensors.
- Use MPI-based tensor communication for distributed CPU training.
System Requirements
| Requirement | Minimum | Recommended |
|---|---|---|
| Python | 3.10 | 3.12 |
| Operating System | Linux (manylinux2014), Windows, macOS | Linux x86_64 |
| Architecture | x86_64, aarch64 | x86_64 |
| RAM | 4 GB | 16 GB+ (model dependent) |
| Disk | 500 MB (package) | 1 GB+ (with training data) |
Dependencies
Python Packages
| Package | Version | Purpose |
|---|---|---|
| onnxruntime-training | >= 1.25.0 | Training package variant with CPU training kernels |
| numpy | >= 1.21.6 | Tensor I/O and data manipulation |
| onnx | >= 1.12 | ONNX model format support |
| flatbuffers | Checkpoint serialization | |
| protobuf | Model format support |
Optional Dependencies
| Package | Purpose |
|---|---|
| mpi4py | Required only for MpiSend/MpiRecv distributed communication kernels |
| h5py | Checkpoint I/O in HDF5 format |
| cerberus | Configuration validation for training parameters |
Code Evidence
Source: orttraining/orttraining/training_ops/cpu/activation/activations_grad.cc - GeluGrad, FastGeluGrad, BiasGeluGrad_dX, BiasFastGeluGrad_dX kernels Source: orttraining/orttraining/training_ops/cpu/optimizer/adamw/adamw.cc - AdamW optimizer kernel on CPU Source: orttraining/orttraining/training_ops/cpu/optimizer/sgd/sgd.cc - SGDOptimizerV2 kernel on CPU Source: orttraining/orttraining/training_ops/cpu/loss/cross_entropy.cc - CrossEntropy loss and gradient on CPU Source: orttraining/orttraining/training_ops/cpu/loss/softmax_cross_entropy_loss.cc - SoftmaxCrossEntropyLoss and gradient on CPU Source: orttraining/orttraining/training_ops/cpu/communication/send.cc - MPI tensor send for distributed training Source: orttraining/orttraining/training_ops/cpu/communication/recv.cc - MPI tensor receive for distributed training
Related Pages
Implementations Using This Environment
- Implementation:Microsoft_Onnxruntime_CPU_ActivationsGrad
- Implementation:Microsoft_Onnxruntime_CPU_AdamW
- Implementation:Microsoft_Onnxruntime_CPU_Adasum
- Implementation:Microsoft_Onnxruntime_CPU_BatchNormGrad
- Implementation:Microsoft_Onnxruntime_CPU_BroadcastGradArgs
- Implementation:Microsoft_Onnxruntime_CPU_ClipGradNorm
- Implementation:Microsoft_Onnxruntime_CPU_ConvGrad
- Implementation:Microsoft_Onnxruntime_CPU_CrossEntropy
- Implementation:Microsoft_Onnxruntime_CPU_Dropout7
- Implementation:Microsoft_Onnxruntime_CPU_DropoutGrad
- Implementation:Microsoft_Onnxruntime_CPU_FakeQuant
- Implementation:Microsoft_Onnxruntime_CPU_GRU_Forward
- Implementation:Microsoft_Onnxruntime_CPU_GRU_Grad
- Implementation:Microsoft_Onnxruntime_CPU_GRU_GradCompute
- Implementation:Microsoft_Onnxruntime_CPU_GRU_IOUtils
- Implementation:Microsoft_Onnxruntime_CPU_GatherElementsGrad
- Implementation:Microsoft_Onnxruntime_CPU_GatherGrad
- Implementation:Microsoft_Onnxruntime_CPU_GatherNDGrad
- Implementation:Microsoft_Onnxruntime_CPU_GradientControl
- Implementation:Microsoft_Onnxruntime_CPU_LSTM_Forward
- Implementation:Microsoft_Onnxruntime_CPU_LSTM_Grad
- Implementation:Microsoft_Onnxruntime_CPU_LSTM_GradCompute
- Implementation:Microsoft_Onnxruntime_CPU_LSTM_IOUtils
- Implementation:Microsoft_Onnxruntime_CPU_LayerNormGrad
- Implementation:Microsoft_Onnxruntime_CPU_MpiRecv
- Implementation:Microsoft_Onnxruntime_CPU_MpiSend
- Implementation:Microsoft_Onnxruntime_CPU_OpGradients
- Implementation:Microsoft_Onnxruntime_CPU_PoolGrad
- Implementation:Microsoft_Onnxruntime_CPU_ReductionAll
- Implementation:Microsoft_Onnxruntime_CPU_ReductionOps
- Implementation:Microsoft_Onnxruntime_CPU_SGD_Adam
- Implementation:Microsoft_Onnxruntime_CPU_SGDv2
- Implementation:Microsoft_Onnxruntime_CPU_Scale
- Implementation:Microsoft_Onnxruntime_CPU_SliceGrad
- Implementation:Microsoft_Onnxruntime_CPU_SoftmaxCrossEntropyLoss
- Implementation:Microsoft_Onnxruntime_CPU_SummaryOps
- Implementation:Microsoft_Onnxruntime_CPU_TrainingConcat
- Implementation:Microsoft_Onnxruntime_CPU_TrainingSplit