Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples Domino Training

From Leeroopedia


Knowledge Sources
Domains Distributed Training, Large Language Models
Last Updated 2026-02-07 12:00 GMT

Overview

Domino pretraining utilities module adapted from Megatron-LM that orchestrates the full distributed training pipeline including model setup, data loading, training loop, evaluation, and checkpointing.

Description

This module provides the complete pretraining orchestration for the DeepSpeed-Domino distributed training system, adapted from Megatron-LM's training.py. The central function pretrain() manages the entire training lifecycle: initializing Megatron, setting up the model and optimizer, building data iterators, running the training loop, and performing evaluation. It integrates with Megatron's pipeline and tensor parallelism infrastructure.

The module includes setup_model_and_optimizer() which builds the model using a user-provided model_builder function, wraps it with the appropriate distributed data parallel wrapper (LocalDDP or torchDDP), initializes the Megatron optimizer, configures the learning rate scheduler, and optionally loads from a checkpoint. The get_model() function handles model construction with support for Float16Module wrapping and distributed data parallel configuration.

The train() function implements the main training loop with CUDA event-based timing, loss logging, and iteration tracking. It calls train_step() for each iteration which handles the forward-backward pass using Megatron's pipeline-parallel forward_backward_func, gradient reduction, and optimizer stepping. Additional utilities include training_log() for TensorBoard logging, evaluate() for running validation, evaluate_and_print_results() for formatted evaluation output, and save_checkpoint_and_time() for timed checkpoint saving.

Usage

Use this module as the main entry point for Domino-accelerated pretraining. Call the pretrain() function with a model builder, dataset builder, and forward step function to launch the full distributed training pipeline. It is designed for large-scale language model pretraining with Megatron-LM parallelism strategies enhanced by DeepSpeed Domino's communication overlap optimization.

Code Reference

Source Location

Signature

def pretrain(model_builder, dataset_builder, forward_step_func):

def setup_model_and_optimizer(model_builder, model_type,
                               no_wd_decay_cond=None,
                               scale_lr_cond=None, lr_mult=1.0):

def get_model(model_builder, model_type=ModelType.encoder_or_decoder,
              wrap_with_ddp=True):

def train(forward_step_func, model, optimizer, opt_param_scheduler,
          train_data_iterator, valid_data_iterator, config):

def train_step(forward_step_func, data_iterator, model,
               optimizer, opt_param_scheduler, config):

def evaluate(forward_step_func, data_iterator, model, config,
             verbose=False):

def evaluate_and_print_results(prefix, forward_step_func,
                               data_iterator, model, iteration, config,
                               verbose=False, write_to_tensorboard=False):

Import

from domino.training import pretrain

I/O Contract

Inputs

Name Type Required Description
model_builder Callable Yes Function that takes pre_process and post_process bools and returns a model
dataset_builder Callable Yes Function that takes dataset sizes and returns train/valid/test datasets
forward_step_func Callable Yes Function that takes data_iterator and model, returns loss and metrics dict
model_type ModelType No Model type enum (default: encoder_or_decoder)
config TransformerConfig Yes Model configuration (used in train loop)
train_data_iterator Iterator Yes Iterator over training data batches
valid_data_iterator Iterator No Iterator over validation data batches

Outputs

Name Type Description
model List[Module] List of model shards (for pipeline parallelism)
optimizer MegatronOptimizer The configured optimizer
opt_param_scheduler OptimizerParamScheduler Learning rate scheduler
iteration int Final iteration count after training completes

Usage Examples

from domino.training import pretrain

def model_builder(pre_process, post_process):
    return GPTModel(pre_process=pre_process, post_process=post_process)

def dataset_builder(train_val_test_num_samples):
    return build_train_valid_test_datasets(train_val_test_num_samples)

def forward_step(data_iterator, model):
    batch = next(data_iterator)
    loss = model(batch)
    return loss, {'lm loss': loss}

# Launch full pretraining pipeline
pretrain(
    model_builder=model_builder,
    dataset_builder=dataset_builder,
    forward_step_func=forward_step
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment