Implementation:Microsoft DeepSpeedExamples BingBert PytorchModeling

Knowledge Sources	Microsoft_DeepSpeedExamples
Domains	BERT Modeling, NLP
Last Updated	2026-02-07 12:00 GMT

Overview

PyTorch BERT model implementation from the pytorch_pretrained_bert package used in the Bing BERT training pipeline.

Description

This module provides the standard PyTorch BERT model implementation based on the original Google AI and HuggingFace codebase. It includes the BertConfig class for model configuration, the complete transformer encoder architecture (embeddings, self-attention, feed-forward layers), and multiple task-specific model heads.

The implementation follows the original post-LayerNorm BERT architecture where layer normalization is applied after the residual connection. It includes BertLayerNorm with custom epsilon handling, GELU and Swish activation functions, and support for loading pretrained weights from both PyTorch and TensorFlow checkpoints.

The module provides task-specific model classes including PreTrainedBertModel as the base class, BertForPreTraining (MLM + NSP), BertForMaskedLM, BertForNextSentencePrediction, BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification, and BertForQuestionAnswering. Pretrained model archives are available for base/large, cased/uncased, and multilingual variants.

Usage

Use this module for standard BERT modeling within the Bing BERT training framework when the original post-LayerNorm architecture is desired, or when loading pretrained HuggingFace BERT checkpoints.

Code Reference

Source Location

Repository: Microsoft_DeepSpeedExamples
File: training/bing_bert/pytorch_pretrained_bert/modeling.py
Lines: 1-1254

Signature

class BertConfig(object)
class BertLayerNorm(nn.Module)
class BertEmbeddings(nn.Module)
class BertSelfAttention(nn.Module)
class BertSelfOutput(nn.Module)
class BertAttention(nn.Module)
class BertIntermediate(nn.Module)
class BertOutput(nn.Module)
class BertLayer(nn.Module)
class BertEncoder(nn.Module)
class BertPooler(nn.Module)
class BertPredictionHeadTransform(nn.Module)
class BertLMPredictionHead(nn.Module)
class BertOnlyMLMHead(nn.Module)
class BertOnlyNSPHead(nn.Module)
class BertPreTrainingHeads(nn.Module)
class PreTrainedBertModel(nn.Module)
class BertModel(PreTrainedBertModel)
class BertForPreTraining(PreTrainedBertModel)
class BertForMaskedLM(PreTrainedBertModel)
class BertForNextSentencePrediction(PreTrainedBertModel)
class BertForSequenceClassification(PreTrainedBertModel)
class BertForMultipleChoice(PreTrainedBertModel)
class BertForTokenClassification(PreTrainedBertModel)
class BertForQuestionAnswering(PreTrainedBertModel)

Import

from pytorch_pretrained_bert.modeling import (
    BertConfig, BertModel, BertForPreTraining,
    BertForSequenceClassification, BertForQuestionAnswering,
    PreTrainedBertModel, WEIGHTS_NAME, CONFIG_NAME
)

I/O Contract

Inputs

Name	Type	Required	Description
vocab_size_or_config_json_file	int/str	Yes	Vocabulary size or path to JSON config file for BertConfig
input_ids	Tensor	Yes	Token IDs of shape (batch_size, seq_length)
token_type_ids	Tensor	No	Segment IDs of shape (batch_size, seq_length), defaults to zeros
attention_mask	Tensor	No	Attention mask of shape (batch_size, seq_length), defaults to ones
hidden_size	int	No	Encoder hidden dimension, default 768
num_hidden_layers	int	No	Number of transformer layers, default 12
num_attention_heads	int	No	Number of attention heads, default 12

Outputs

Name	Type	Description
all_encoder_layers	list[Tensor]	Hidden states from each encoder layer, each of shape (batch_size, seq_length, hidden_size)
pooled_output	Tensor	Pooled [CLS] representation of shape (batch_size, hidden_size)
prediction_scores	Tensor	MLM logits of shape (batch_size, seq_length, vocab_size) for pretraining variants
seq_relationship_score	Tensor	NSP logits of shape (batch_size, 2) for pretraining variants

Usage Examples

from pytorch_pretrained_bert.modeling import BertConfig, BertModel

config = BertConfig(
    vocab_size_or_config_json_file=30522,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072
)

model = BertModel(config)
all_encoder_layers, pooled_output = model(input_ids, token_type_ids, attention_mask)
last_hidden = all_encoder_layers[-1]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment