Implementation:Microsoft DeepSpeedExamples BingBert TuringDataset
| Knowledge Sources | |
|---|---|
| Domains | Data Loading, BERT Pretraining |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Turing dataset classes for BERT pretraining, providing PyTorch Dataset implementations for QA, ranking, and masked language model pretraining tasks.
Description
This module defines multiple PyTorch Dataset implementations used in the Bing BERT / Turing training pipeline. It includes QADataset for query-passage pair prediction, QAFinetuningDataset for fine-tuning on query-passage data, RankingDataset for query-instance ranking tasks, and PreTrainingDataset for masked language modeling (MLM) and next sentence prediction (NSP) pretraining.
The PreTrainingDataset is the primary dataset class for BERT pretraining. It supports both NumPy-based data loading (via NumpyPretrainingDataCreator) and validation data loading. It implements dynamic masked language modeling with configurable masking probability (default 0.15) and maximum predictions per sequence, following the original BERT masking strategy of 80% [MASK], 10% random, and 10% original tokens.
The module also provides utility functions and enums including BatchType for distinguishing between ranking, QP, and pretrain batches, PretrainDataType for NumPy vs validation data modes, BertJobType for task enumeration, and helper functions for encoding sequences with [CLS]/[SEP] tokens, truncating input sequences, and converting data to PyTorch tensors.
Usage
Use these dataset classes when setting up the data pipeline for BERT pretraining or finetuning within the Bing BERT / Turing framework. The PreTrainingDataset is used by the main training script for MLM/NSP pretraining.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: training/bing_bert/turing/dataset.py
- Lines: 1-390
Signature
class BatchType(IntEnum)
class PretrainDataType(IntEnum)
class BertJobType(IntEnum)
def get_random_partition(data_directory, index)
def map_to_torch(encoding)
def map_to_torch_float(encoding)
def map_to_torch_half(encoding)
def encode_sequence(seqA, seqB, max_seq_len, tokenizer)
def truncate_input_sequence(tokens_a, tokens_b, max_num_tokens)
class QADataset(Dataset)
class QAFinetuningDataset(QADataset)
class RankingDataset(Dataset)
class PreTrainingDataset(Dataset)
Import
from turing.dataset import (
PreTrainingDataset, PretrainBatch, PretrainDataType,
QADataset, RankingDataset, BatchType, BertJobType
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenizer | BertTokenizer | Yes | BERT tokenizer for converting text to token IDs |
| folder | str | Yes | Path to the data directory containing partitioned data files |
| logger | Logger | Yes | Logger instance for status messages during data loading |
| max_seq_length | int | Yes | Maximum sequence length for input encoding |
| index | int | Yes | Epoch or shard index for selecting data partitions |
| data_type | PretrainDataType | No | Data source type: NUMPY (default) or VALIDATION |
| max_predictions_per_seq | int | No | Maximum masked tokens per sequence, default 20 |
Outputs
| Name | Type | Description |
|---|---|---|
| batch_type | Tensor | Integer indicating batch type (QP, RANKING, or PRETRAIN) |
| input_ids | Tensor | Token IDs of shape (seq_length,) |
| input_mask | Tensor | Attention mask of shape (seq_length,) |
| sequence_ids | Tensor | Segment IDs of shape (seq_length,) |
| label | Tensor | Task label (float for QA/ranking, int for NSP) |
| masked_lm_output | Tensor | Masked LM target positions and labels for pretraining |
Usage Examples
from turing.dataset import PreTrainingDataset, PretrainDataType
from pytorch_pretrained_bert.tokenization import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
dataset = PreTrainingDataset(
tokenizer=tokenizer,
folder="/data/pretraining/",
logger=logger,
max_seq_length=512,
index=0,
data_type=PretrainDataType.NUMPY,
max_predictions_per_seq=80
)
# Use with DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)