Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples Domino Pretrain GPT

From Leeroopedia


Knowledge Sources
Domains Deep Learning, Language Modeling, Distributed Training
Last Updated 2026-02-07 12:00 GMT

Overview

Entry point script for GPT pretraining using the DeepSpeed Domino framework, providing model construction, dataset building, forward step logic, and loss computation.

Description

This module serves as the main pretraining entry point for GPT models under the DeepSpeed Domino distributed training system. It defines four key functions that are passed as callbacks to the Domino pretrain function: model_builder, dataset_builder, forward_step, and loss_func.

The model_builder function constructs a GPTModel instance from the Domino-adapted Megatron-LM architecture using core_transformer_config_from_args, with configurable pre/post-processing flags for pipeline parallelism. The dataset_builder function calls Megatron's build_train_valid_test_datasets with the appropriate data paths, sequence length, seed, and cache configuration to produce training, validation, and test datasets.

The forward_step function implements the per-microbatch forward pass: it retrieves a batch via get_batch (which broadcasts data across tensor-parallel ranks using tensor_parallel.broadcast_data), splits tokens into inputs and labels with an offset of one position for causal LM training, generates attention masks and position IDs via get_ltor_masks_and_position_ids, and returns the model output along with a partial loss_func closure. The loss_func applies a loss_mask to the raw per-token losses and averages them, then reduces across data-parallel groups via average_losses_across_data_parallel_group for logging.

Usage

Use this script as the main entry point for GPT pretraining with DeepSpeed Domino. Launch it via the DeepSpeed distributed launcher with appropriate Megatron-LM and Domino configuration arguments. It is the top-level orchestrator that wires together model, data, and training loop components.

Code Reference

Source Location

Signature

def is_rank_0() -> bool:
def model_builder(pre_process=True, post_process=True) -> GPTModel:
def dataset_builder(train_val_test_num_samples) -> tuple:
def forward_step(data_iterator, model) -> tuple:
def get_batch(data_iterator) -> tuple:
def loss_func(loss_mask, output_tensor) -> tuple:

Import

from pretrain_gpt import model_builder, dataset_builder, forward_step, loss_func

I/O Contract

Inputs

Name Type Required Description
pre_process bool No Include embedding layer in pipeline stage (default: True)
post_process bool No Include output head in pipeline stage (default: True)
train_val_test_num_samples list of int Yes Number of samples for train, validation, and test splits
data_iterator iterator Yes Iterator over tokenized text batches
model GPTModel Yes The GPT model instance for forward computation
loss_mask Tensor Yes Binary mask indicating which tokens contribute to loss
output_tensor Tensor Yes Raw per-token loss values from the model

Outputs

Name Type Description
model GPTModel Constructed GPT model instance from model_builder
train_ds, valid_ds, test_ds tuple of Dataset Training, validation, and test datasets from dataset_builder
output_tensor Tensor Model output from forward_step
loss_func_partial partial Partial function binding loss_mask for deferred loss computation
loss Tensor Scalar masked and averaged loss value
metrics dict Dictionary with 'lm loss' key containing the data-parallel averaged loss

Usage Examples

# Entry point usage (typically via command line)
# python pretrain_gpt.py --deepspeed --deepspeed_config ds_config.json ...

# Programmatic usage of components
from pretrain_gpt import model_builder, dataset_builder, forward_step

model = model_builder(pre_process=True, post_process=True)
train_ds, valid_ds, test_ds = dataset_builder([1000, 100, 100])

# In training loop
output, loss_partial = forward_step(data_iterator, model)
loss, metrics = loss_partial(output)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment