Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples BingBert TuringDataset

From Leeroopedia
Revision as of 15:40, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Microsoft_DeepSpeedExamples_BingBert_TuringDataset.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Data Loading, BERT Pretraining
Last Updated 2026-02-07 12:00 GMT

Overview

Turing dataset classes for BERT pretraining, providing PyTorch Dataset implementations for QA, ranking, and masked language model pretraining tasks.

Description

This module defines multiple PyTorch Dataset implementations used in the Bing BERT / Turing training pipeline. It includes QADataset for query-passage pair prediction, QAFinetuningDataset for fine-tuning on query-passage data, RankingDataset for query-instance ranking tasks, and PreTrainingDataset for masked language modeling (MLM) and next sentence prediction (NSP) pretraining.

The PreTrainingDataset is the primary dataset class for BERT pretraining. It supports both NumPy-based data loading (via NumpyPretrainingDataCreator) and validation data loading. It implements dynamic masked language modeling with configurable masking probability (default 0.15) and maximum predictions per sequence, following the original BERT masking strategy of 80% [MASK], 10% random, and 10% original tokens.

The module also provides utility functions and enums including BatchType for distinguishing between ranking, QP, and pretrain batches, PretrainDataType for NumPy vs validation data modes, BertJobType for task enumeration, and helper functions for encoding sequences with [CLS]/[SEP] tokens, truncating input sequences, and converting data to PyTorch tensors.

Usage

Use these dataset classes when setting up the data pipeline for BERT pretraining or finetuning within the Bing BERT / Turing framework. The PreTrainingDataset is used by the main training script for MLM/NSP pretraining.

Code Reference

Source Location

Signature

class BatchType(IntEnum)
class PretrainDataType(IntEnum)
class BertJobType(IntEnum)

def get_random_partition(data_directory, index)
def map_to_torch(encoding)
def map_to_torch_float(encoding)
def map_to_torch_half(encoding)
def encode_sequence(seqA, seqB, max_seq_len, tokenizer)
def truncate_input_sequence(tokens_a, tokens_b, max_num_tokens)

class QADataset(Dataset)
class QAFinetuningDataset(QADataset)
class RankingDataset(Dataset)
class PreTrainingDataset(Dataset)

Import

from turing.dataset import (
    PreTrainingDataset, PretrainBatch, PretrainDataType,
    QADataset, RankingDataset, BatchType, BertJobType
)

I/O Contract

Inputs

Name Type Required Description
tokenizer BertTokenizer Yes BERT tokenizer for converting text to token IDs
folder str Yes Path to the data directory containing partitioned data files
logger Logger Yes Logger instance for status messages during data loading
max_seq_length int Yes Maximum sequence length for input encoding
index int Yes Epoch or shard index for selecting data partitions
data_type PretrainDataType No Data source type: NUMPY (default) or VALIDATION
max_predictions_per_seq int No Maximum masked tokens per sequence, default 20

Outputs

Name Type Description
batch_type Tensor Integer indicating batch type (QP, RANKING, or PRETRAIN)
input_ids Tensor Token IDs of shape (seq_length,)
input_mask Tensor Attention mask of shape (seq_length,)
sequence_ids Tensor Segment IDs of shape (seq_length,)
label Tensor Task label (float for QA/ranking, int for NSP)
masked_lm_output Tensor Masked LM target positions and labels for pretraining

Usage Examples

from turing.dataset import PreTrainingDataset, PretrainDataType
from pytorch_pretrained_bert.tokenization import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

dataset = PreTrainingDataset(
    tokenizer=tokenizer,
    folder="/data/pretraining/",
    logger=logger,
    max_seq_length=512,
    index=0,
    data_type=PretrainDataType.NUMPY,
    max_predictions_per_seq=80
)

# Use with DataLoader
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_size=32, shuffle=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment