Implementation:Microsoft DeepSpeedExamples Domino Pretrain GPT

Knowledge Sources	Microsoft_DeepSpeedExamples
Domains	Deep Learning, Language Modeling, Distributed Training
Last Updated	2026-02-07 12:00 GMT

Overview

Entry point script for GPT pretraining using the DeepSpeed Domino framework, providing model construction, dataset building, forward step logic, and loss computation.

Description

This module serves as the main pretraining entry point for GPT models under the DeepSpeed Domino distributed training system. It defines four key functions that are passed as callbacks to the Domino pretrain function: model_builder, dataset_builder, forward_step, and loss_func.

The model_builder function constructs a GPTModel instance from the Domino-adapted Megatron-LM architecture using core_transformer_config_from_args, with configurable pre/post-processing flags for pipeline parallelism. The dataset_builder function calls Megatron's build_train_valid_test_datasets with the appropriate data paths, sequence length, seed, and cache configuration to produce training, validation, and test datasets.

The forward_step function implements the per-microbatch forward pass: it retrieves a batch via get_batch (which broadcasts data across tensor-parallel ranks using tensor_parallel.broadcast_data), splits tokens into inputs and labels with an offset of one position for causal LM training, generates attention masks and position IDs via get_ltor_masks_and_position_ids, and returns the model output along with a partial loss_func closure. The loss_func applies a loss_mask to the raw per-token losses and averages them, then reduces across data-parallel groups via average_losses_across_data_parallel_group for logging.

Usage

Use this script as the main entry point for GPT pretraining with DeepSpeed Domino. Launch it via the DeepSpeed distributed launcher with appropriate Megatron-LM and Domino configuration arguments. It is the top-level orchestrator that wires together model, data, and training loop components.

Code Reference

Source Location

Repository: Microsoft_DeepSpeedExamples
File: training/DeepSpeed-Domino/pretrain_gpt.py
Lines: 1-114

Signature

def is_rank_0() -> bool:
def model_builder(pre_process=True, post_process=True) -> GPTModel:
def dataset_builder(train_val_test_num_samples) -> tuple:
def forward_step(data_iterator, model) -> tuple:
def get_batch(data_iterator) -> tuple:
def loss_func(loss_mask, output_tensor) -> tuple:

Import

from pretrain_gpt import model_builder, dataset_builder, forward_step, loss_func

I/O Contract

Inputs

Name	Type	Required	Description
pre_process	bool	No	Include embedding layer in pipeline stage (default: True)
post_process	bool	No	Include output head in pipeline stage (default: True)
train_val_test_num_samples	list of int	Yes	Number of samples for train, validation, and test splits
data_iterator	iterator	Yes	Iterator over tokenized text batches
model	GPTModel	Yes	The GPT model instance for forward computation
loss_mask	Tensor	Yes	Binary mask indicating which tokens contribute to loss
output_tensor	Tensor	Yes	Raw per-token loss values from the model

Outputs

Name	Type	Description
model	GPTModel	Constructed GPT model instance from model_builder
train_ds, valid_ds, test_ds	tuple of Dataset	Training, validation, and test datasets from dataset_builder
output_tensor	Tensor	Model output from forward_step
loss_func_partial	partial	Partial function binding loss_mask for deferred loss computation
loss	Tensor	Scalar masked and averaged loss value
metrics	dict	Dictionary with 'lm loss' key containing the data-parallel averaged loss

Usage Examples

# Entry point usage (typically via command line)
# python pretrain_gpt.py --deepspeed --deepspeed_config ds_config.json ...

# Programmatic usage of components
from pretrain_gpt import model_builder, dataset_builder, forward_step

model = model_builder(pre_process=True, post_process=True)
train_ds, valid_ds, test_ds = dataset_builder([1000, 100, 100])

# In training loop
output, loss_partial = forward_step(data_iterator, model)
loss, metrics = loss_partial(output)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment