Implementation:Microsoft DeepSpeedExamples BingBert TuringDataset

Knowledge Sources	Microsoft_DeepSpeedExamples
Domains	Data Loading, BERT Pretraining
Last Updated	2026-02-07 12:00 GMT

Overview

Turing dataset classes for BERT pretraining, providing PyTorch Dataset implementations for QA, ranking, and masked language model pretraining tasks.

Description

This module defines multiple PyTorch Dataset implementations used in the Bing BERT / Turing training pipeline. It includes QADataset for query-passage pair prediction, QAFinetuningDataset for fine-tuning on query-passage data, RankingDataset for query-instance ranking tasks, and PreTrainingDataset for masked language modeling (MLM) and next sentence prediction (NSP) pretraining.

The PreTrainingDataset is the primary dataset class for BERT pretraining. It supports both NumPy-based data loading (via NumpyPretrainingDataCreator) and validation data loading. It implements dynamic masked language modeling with configurable masking probability (default 0.15) and maximum predictions per sequence, following the original BERT masking strategy of 80% [MASK], 10% random, and 10% original tokens.

The module also provides utility functions and enums including BatchType for distinguishing between ranking, QP, and pretrain batches, PretrainDataType for NumPy vs validation data modes, BertJobType for task enumeration, and helper functions for encoding sequences with [CLS]/[SEP] tokens, truncating input sequences, and converting data to PyTorch tensors.

Usage

Use these dataset classes when setting up the data pipeline for BERT pretraining or finetuning within the Bing BERT / Turing framework. The PreTrainingDataset is used by the main training script for MLM/NSP pretraining.

Code Reference

Source Location

Repository: Microsoft_DeepSpeedExamples
File: training/bing_bert/turing/dataset.py
Lines: 1-390

Signature

class BatchType(IntEnum)
class PretrainDataType(IntEnum)
class BertJobType(IntEnum)

def get_random_partition(data_directory, index)
def map_to_torch(encoding)
def map_to_torch_float(encoding)
def map_to_torch_half(encoding)
def encode_sequence(seqA, seqB, max_seq_len, tokenizer)
def truncate_input_sequence(tokens_a, tokens_b, max_num_tokens)

class QADataset(Dataset)
class QAFinetuningDataset(QADataset)
class RankingDataset(Dataset)
class PreTrainingDataset(Dataset)

Import

from turing.dataset import (
    PreTrainingDataset, PretrainBatch, PretrainDataType,
    QADataset, RankingDataset, BatchType, BertJobType
)

I/O Contract

Inputs

Name	Type	Required	Description
tokenizer	BertTokenizer	Yes	BERT tokenizer for converting text to token IDs
folder	str	Yes	Path to the data directory containing partitioned data files
logger	Logger	Yes	Logger instance for status messages during data loading
max_seq_length	int	Yes	Maximum sequence length for input encoding
index	int	Yes	Epoch or shard index for selecting data partitions
data_type	PretrainDataType	No	Data source type: NUMPY (default) or VALIDATION
max_predictions_per_seq	int	No	Maximum masked tokens per sequence, default 20

Outputs

Name	Type	Description
batch_type	Tensor	Integer indicating batch type (QP, RANKING, or PRETRAIN)
input_ids	Tensor	Token IDs of shape (seq_length,)
input_mask	Tensor	Attention mask of shape (seq_length,)
sequence_ids	Tensor	Segment IDs of shape (seq_length,)
label	Tensor	Task label (float for QA/ranking, int for NSP)
masked_lm_output	Tensor	Masked LM target positions and labels for pretraining

Usage Examples

from turing.dataset import PreTrainingDataset, PretrainDataType
from pytorch_pretrained_bert.tokenization import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

dataset = PreTrainingDataset(
    tokenizer=tokenizer,
    folder="/data/pretraining/",
    logger=logger,
    max_seq_length=512,
    index=0,
    data_type=PretrainDataType.NUMPY,
    max_predictions_per_seq=80
)

# Use with DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment