Implementation:Microsoft DeepSpeedExamples BingBert PytorchModeling
| Knowledge Sources | |
|---|---|
| Domains | BERT Modeling, NLP |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
PyTorch BERT model implementation from the pytorch_pretrained_bert package used in the Bing BERT training pipeline.
Description
This module provides the standard PyTorch BERT model implementation based on the original Google AI and HuggingFace codebase. It includes the BertConfig class for model configuration, the complete transformer encoder architecture (embeddings, self-attention, feed-forward layers), and multiple task-specific model heads.
The implementation follows the original post-LayerNorm BERT architecture where layer normalization is applied after the residual connection. It includes BertLayerNorm with custom epsilon handling, GELU and Swish activation functions, and support for loading pretrained weights from both PyTorch and TensorFlow checkpoints.
The module provides task-specific model classes including PreTrainedBertModel as the base class, BertForPreTraining (MLM + NSP), BertForMaskedLM, BertForNextSentencePrediction, BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification, and BertForQuestionAnswering. Pretrained model archives are available for base/large, cased/uncased, and multilingual variants.
Usage
Use this module for standard BERT modeling within the Bing BERT training framework when the original post-LayerNorm architecture is desired, or when loading pretrained HuggingFace BERT checkpoints.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: training/bing_bert/pytorch_pretrained_bert/modeling.py
- Lines: 1-1254
Signature
class BertConfig(object)
class BertLayerNorm(nn.Module)
class BertEmbeddings(nn.Module)
class BertSelfAttention(nn.Module)
class BertSelfOutput(nn.Module)
class BertAttention(nn.Module)
class BertIntermediate(nn.Module)
class BertOutput(nn.Module)
class BertLayer(nn.Module)
class BertEncoder(nn.Module)
class BertPooler(nn.Module)
class BertPredictionHeadTransform(nn.Module)
class BertLMPredictionHead(nn.Module)
class BertOnlyMLMHead(nn.Module)
class BertOnlyNSPHead(nn.Module)
class BertPreTrainingHeads(nn.Module)
class PreTrainedBertModel(nn.Module)
class BertModel(PreTrainedBertModel)
class BertForPreTraining(PreTrainedBertModel)
class BertForMaskedLM(PreTrainedBertModel)
class BertForNextSentencePrediction(PreTrainedBertModel)
class BertForSequenceClassification(PreTrainedBertModel)
class BertForMultipleChoice(PreTrainedBertModel)
class BertForTokenClassification(PreTrainedBertModel)
class BertForQuestionAnswering(PreTrainedBertModel)
Import
from pytorch_pretrained_bert.modeling import (
BertConfig, BertModel, BertForPreTraining,
BertForSequenceClassification, BertForQuestionAnswering,
PreTrainedBertModel, WEIGHTS_NAME, CONFIG_NAME
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vocab_size_or_config_json_file | int/str | Yes | Vocabulary size or path to JSON config file for BertConfig |
| input_ids | Tensor | Yes | Token IDs of shape (batch_size, seq_length) |
| token_type_ids | Tensor | No | Segment IDs of shape (batch_size, seq_length), defaults to zeros |
| attention_mask | Tensor | No | Attention mask of shape (batch_size, seq_length), defaults to ones |
| hidden_size | int | No | Encoder hidden dimension, default 768 |
| num_hidden_layers | int | No | Number of transformer layers, default 12 |
| num_attention_heads | int | No | Number of attention heads, default 12 |
Outputs
| Name | Type | Description |
|---|---|---|
| all_encoder_layers | list[Tensor] | Hidden states from each encoder layer, each of shape (batch_size, seq_length, hidden_size) |
| pooled_output | Tensor | Pooled [CLS] representation of shape (batch_size, hidden_size) |
| prediction_scores | Tensor | MLM logits of shape (batch_size, seq_length, vocab_size) for pretraining variants |
| seq_relationship_score | Tensor | NSP logits of shape (batch_size, 2) for pretraining variants |
Usage Examples
from pytorch_pretrained_bert.modeling import BertConfig, BertModel
config = BertConfig(
vocab_size_or_config_json_file=30522,
hidden_size=768,
num_hidden_layers=12,
num_attention_heads=12,
intermediate_size=3072
)
model = BertModel(config)
all_encoder_layers, pooled_output = model(input_ids, token_type_ids, attention_mask)
last_hidden = all_encoder_layers[-1]