Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples Domino GPT Model

From Leeroopedia
Revision as of 15:40, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_DeepSpeedExamples_Domino_GPT_Model.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Deep Learning, Language Modeling, Distributed Training
Last Updated 2026-02-07 12:00 GMT

Overview

A GPT-2 language model wrapper adapted from Megatron-LM for the DeepSpeed Domino distributed training framework, supporting pipeline parallelism with pre/post-processing stages.

Description

The GPTModel class extends MegatronModule to implement a GPT-2 language model for use with the DeepSpeed Domino distributed training system. It wraps the Megatron-LM get_language_model factory with causal attention masking (AttnMaskType.causal) and supports pipeline parallelism through configurable pre_process and post_process flags that control whether embedding and output-head layers are included in the current pipeline stage.

The forward method passes input tokens through the language model and, when post_process is enabled, calls post_language_model_processing to compute the final logits or loss. The post_language_model_processing function uses parallel_lm_logits for tensor-parallel logit computation and vocab_parallel_cross_entropy for efficient vocabulary-parallel loss calculation, supporting optional fp16_lm_cross_entropy for mixed-precision training. It handles the [s, b, h] to [b, s, h] tensor transpositions required by the Megatron layout convention.

The model supports both tied and untied embedding-output weight configurations via the untie_embeddings_and_output_weights argument. Custom state_dict_for_save_checkpoint and load_state_dict methods ensure correct serialization of the language model and optional word embedding weights across pipeline stages.

Usage

Use this model class when training GPT models with the DeepSpeed Domino framework, which provides optimized scheduling for hybrid data, tensor, and pipeline parallelism. It is instantiated by the model_builder function in pretrain_gpt.py and passed to the Domino pretrain function.

Code Reference

Source Location

Signature

def post_language_model_processing(lm_output, labels, logit_weights,
                                   parallel_output,
                                   fp16_lm_cross_entropy) -> Tensor:

class GPTModel(MegatronModule):
    def __init__(self,
                 config,
                 num_tokentypes=0,
                 parallel_output=True,
                 pre_process=True,
                 post_process=True):
    def set_input_tensor(self, input_tensor) -> None:
    def forward(self, input_ids, position_ids, attention_mask,
                retriever_input_ids=None, retriever_position_ids=None,
                retriever_attn_mask=None, labels=None,
                tokentype_ids=None, inference_params=None) -> Tensor:
    def state_dict_for_save_checkpoint(self, prefix='', keep_vars=False) -> dict:
    def load_state_dict(self, state_dict, strict=True) -> None:

Import

from domino.gpt_model import GPTModel

I/O Contract

Inputs

Name Type Required Description
config TransformerConfig Yes Megatron transformer configuration object
num_tokentypes int No Number of token types for embeddings (default: 0)
parallel_output bool No Whether to keep output tensor parallel (default: True)
pre_process bool No Include embedding layer in this pipeline stage (default: True)
post_process bool No Include output head in this pipeline stage (default: True)
input_ids Tensor Yes Input token IDs of shape [batch, seq_len]
position_ids Tensor Yes Position IDs of shape [batch, seq_len]
attention_mask Tensor Yes Attention mask tensor
labels Tensor No Target labels; if provided, returns loss instead of logits

Outputs

Name Type Description
output Tensor Logits of shape [batch, seq_len, vocab] when labels is None, or cross-entropy loss of shape [batch, seq_len] when labels are provided

Usage Examples

from domino.gpt_model import GPTModel
from megatron.arguments import core_transformer_config_from_args
from megatron import get_args

args = get_args()
config = core_transformer_config_from_args(args)

model = GPTModel(
    config=config,
    num_tokentypes=0,
    parallel_output=True,
    pre_process=True,
    post_process=True
)

# Forward pass with loss computation
loss = model(input_ids, position_ids, attention_mask, labels=labels)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment