Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples BingBert PytorchModeling

From Leeroopedia


Knowledge Sources
Domains BERT Modeling, NLP
Last Updated 2026-02-07 12:00 GMT

Overview

PyTorch BERT model implementation from the pytorch_pretrained_bert package used in the Bing BERT training pipeline.

Description

This module provides the standard PyTorch BERT model implementation based on the original Google AI and HuggingFace codebase. It includes the BertConfig class for model configuration, the complete transformer encoder architecture (embeddings, self-attention, feed-forward layers), and multiple task-specific model heads.

The implementation follows the original post-LayerNorm BERT architecture where layer normalization is applied after the residual connection. It includes BertLayerNorm with custom epsilon handling, GELU and Swish activation functions, and support for loading pretrained weights from both PyTorch and TensorFlow checkpoints.

The module provides task-specific model classes including PreTrainedBertModel as the base class, BertForPreTraining (MLM + NSP), BertForMaskedLM, BertForNextSentencePrediction, BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification, and BertForQuestionAnswering. Pretrained model archives are available for base/large, cased/uncased, and multilingual variants.

Usage

Use this module for standard BERT modeling within the Bing BERT training framework when the original post-LayerNorm architecture is desired, or when loading pretrained HuggingFace BERT checkpoints.

Code Reference

Source Location

Signature

class BertConfig(object)
class BertLayerNorm(nn.Module)
class BertEmbeddings(nn.Module)
class BertSelfAttention(nn.Module)
class BertSelfOutput(nn.Module)
class BertAttention(nn.Module)
class BertIntermediate(nn.Module)
class BertOutput(nn.Module)
class BertLayer(nn.Module)
class BertEncoder(nn.Module)
class BertPooler(nn.Module)
class BertPredictionHeadTransform(nn.Module)
class BertLMPredictionHead(nn.Module)
class BertOnlyMLMHead(nn.Module)
class BertOnlyNSPHead(nn.Module)
class BertPreTrainingHeads(nn.Module)
class PreTrainedBertModel(nn.Module)
class BertModel(PreTrainedBertModel)
class BertForPreTraining(PreTrainedBertModel)
class BertForMaskedLM(PreTrainedBertModel)
class BertForNextSentencePrediction(PreTrainedBertModel)
class BertForSequenceClassification(PreTrainedBertModel)
class BertForMultipleChoice(PreTrainedBertModel)
class BertForTokenClassification(PreTrainedBertModel)
class BertForQuestionAnswering(PreTrainedBertModel)

Import

from pytorch_pretrained_bert.modeling import (
    BertConfig, BertModel, BertForPreTraining,
    BertForSequenceClassification, BertForQuestionAnswering,
    PreTrainedBertModel, WEIGHTS_NAME, CONFIG_NAME
)

I/O Contract

Inputs

Name Type Required Description
vocab_size_or_config_json_file int/str Yes Vocabulary size or path to JSON config file for BertConfig
input_ids Tensor Yes Token IDs of shape (batch_size, seq_length)
token_type_ids Tensor No Segment IDs of shape (batch_size, seq_length), defaults to zeros
attention_mask Tensor No Attention mask of shape (batch_size, seq_length), defaults to ones
hidden_size int No Encoder hidden dimension, default 768
num_hidden_layers int No Number of transformer layers, default 12
num_attention_heads int No Number of attention heads, default 12

Outputs

Name Type Description
all_encoder_layers list[Tensor] Hidden states from each encoder layer, each of shape (batch_size, seq_length, hidden_size)
pooled_output Tensor Pooled [CLS] representation of shape (batch_size, hidden_size)
prediction_scores Tensor MLM logits of shape (batch_size, seq_length, vocab_size) for pretraining variants
seq_relationship_score Tensor NSP logits of shape (batch_size, 2) for pretraining variants

Usage Examples

from pytorch_pretrained_bert.modeling import BertConfig, BertModel

config = BertConfig(
    vocab_size_or_config_json_file=30522,
    hidden_size=768,
    num_hidden_layers=12,
    num_attention_heads=12,
    intermediate_size=3072
)

model = BertModel(config)
all_encoder_layers, pooled_output = model(input_ids, token_type_ids, attention_mask)
last_hidden = all_encoder_layers[-1]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment