Implementation:Microsoft DeepSpeedExamples Domino Pretrain GPT
| Knowledge Sources | |
|---|---|
| Domains | Deep Learning, Language Modeling, Distributed Training |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Entry point script for GPT pretraining using the DeepSpeed Domino framework, providing model construction, dataset building, forward step logic, and loss computation.
Description
This module serves as the main pretraining entry point for GPT models under the DeepSpeed Domino distributed training system. It defines four key functions that are passed as callbacks to the Domino pretrain function: model_builder, dataset_builder, forward_step, and loss_func.
The model_builder function constructs a GPTModel instance from the Domino-adapted Megatron-LM architecture using core_transformer_config_from_args, with configurable pre/post-processing flags for pipeline parallelism. The dataset_builder function calls Megatron's build_train_valid_test_datasets with the appropriate data paths, sequence length, seed, and cache configuration to produce training, validation, and test datasets.
The forward_step function implements the per-microbatch forward pass: it retrieves a batch via get_batch (which broadcasts data across tensor-parallel ranks using tensor_parallel.broadcast_data), splits tokens into inputs and labels with an offset of one position for causal LM training, generates attention masks and position IDs via get_ltor_masks_and_position_ids, and returns the model output along with a partial loss_func closure. The loss_func applies a loss_mask to the raw per-token losses and averages them, then reduces across data-parallel groups via average_losses_across_data_parallel_group for logging.
Usage
Use this script as the main entry point for GPT pretraining with DeepSpeed Domino. Launch it via the DeepSpeed distributed launcher with appropriate Megatron-LM and Domino configuration arguments. It is the top-level orchestrator that wires together model, data, and training loop components.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: training/DeepSpeed-Domino/pretrain_gpt.py
- Lines: 1-114
Signature
def is_rank_0() -> bool:
def model_builder(pre_process=True, post_process=True) -> GPTModel:
def dataset_builder(train_val_test_num_samples) -> tuple:
def forward_step(data_iterator, model) -> tuple:
def get_batch(data_iterator) -> tuple:
def loss_func(loss_mask, output_tensor) -> tuple:
Import
from pretrain_gpt import model_builder, dataset_builder, forward_step, loss_func
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| pre_process | bool | No | Include embedding layer in pipeline stage (default: True) |
| post_process | bool | No | Include output head in pipeline stage (default: True) |
| train_val_test_num_samples | list of int | Yes | Number of samples for train, validation, and test splits |
| data_iterator | iterator | Yes | Iterator over tokenized text batches |
| model | GPTModel | Yes | The GPT model instance for forward computation |
| loss_mask | Tensor | Yes | Binary mask indicating which tokens contribute to loss |
| output_tensor | Tensor | Yes | Raw per-token loss values from the model |
Outputs
| Name | Type | Description |
|---|---|---|
| model | GPTModel | Constructed GPT model instance from model_builder |
| train_ds, valid_ds, test_ds | tuple of Dataset | Training, validation, and test datasets from dataset_builder |
| output_tensor | Tensor | Model output from forward_step |
| loss_func_partial | partial | Partial function binding loss_mask for deferred loss computation |
| loss | Tensor | Scalar masked and averaged loss value |
| metrics | dict | Dictionary with 'lm loss' key containing the data-parallel averaged loss |
Usage Examples
# Entry point usage (typically via command line)
# python pretrain_gpt.py --deepspeed --deepspeed_config ds_config.json ...
# Programmatic usage of components
from pretrain_gpt import model_builder, dataset_builder, forward_step
model = model_builder(pre_process=True, post_process=True)
train_ds, valid_ds, test_ds = dataset_builder([1000, 100, 100])
# In training loop
output, loss_partial = forward_step(data_iterator, model)
loss, metrics = loss_partial(output)