Implementation:Microsoft DeepSpeedExamples BingBert TuringSources
| Knowledge Sources | |
|---|---|
| Domains | Data Preprocessing, BERT Pretraining |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
Turing data source classes for creating BERT pretraining instances from raw text corpora, including Wikipedia, BookCorpus, and NumPy-serialized datasets.
Description
This module provides data source creator classes that transform raw text documents into BERT pretraining instances (token pairs with next-sentence prediction labels). The base PretrainingDataCreator reads documents separated by "<sep>" delimiters, tokenizes them, and creates training instances by splitting documents into segment pairs with both real next-sentence and random next-sentence examples.
Several specialized data creators extend the base class: CleanBodyDataCreator for cleaned web body text, WikiNBookCorpusPretrainingDataCreator for combined Wikipedia and BookCorpus data, and WikiPretrainingDataCreator for Wikipedia-specific preprocessing. Each handles the unique formatting and structure of its source corpus while producing the same TokenInstance output format.
The module also includes NumpyPretrainingDataCreator and NumpyByteInstances for efficient loading of pre-serialized training data stored in NumPy binary format. Additionally, QueryPassageDataset, QueryPassageFineTuningDataset, and QueryInstanceDataset classes handle loading of query-passage pair data for QA and ranking tasks from tab-separated text files.
Usage
Use these data source classes when preparing pretraining data for BERT from various text corpora. They are consumed by the dataset classes in turing/dataset.py to create PyTorch-compatible data loading pipelines.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: training/bing_bert/turing/sources.py
- Lines: 1-509
Signature
def truncate_input_sequence(tokens_a, tokens_b, max_num_tokens)
class TokenInstance:
def __init__(self, tokens_a, tokens_b, is_next, lang="en")
def get_values(self)
def get_lang(self)
class QueryPassageDataset
class QueryPassageFineTuningDataset
class QueryInstanceDataset
class PretrainingDataCreator:
def __init__(self, path, tokenizer, max_seq_length, readin=2000000, dupe_factor=5, small_seq_prob=0.1)
def create_training_instance(self, index)
def save(self, filename)
@staticmethod
def load(filename)
class CleanBodyDataCreator(PretrainingDataCreator)
class WikiNBookCorpusPretrainingDataCreator(PretrainingDataCreator)
class WikiPretrainingDataCreator(PretrainingDataCreator)
class NumpyByteInstances
class NumpyPretrainingDataCreator
Import
from turing.sources import (
PretrainingDataCreator, WikiPretrainingDataCreator,
TokenInstance, QueryPassageDataset,
NumpyPretrainingDataCreator, CleanBodyDataCreator
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| path | str | Yes | Path to the raw text file or data directory |
| tokenizer | BertTokenizer | Yes | BERT tokenizer for converting text to subword tokens |
| max_seq_length | int | Yes | Maximum total sequence length including special tokens |
| readin | int | No | Maximum number of lines to read from file, default 2000000 |
| dupe_factor | int | No | Number of times to duplicate training instances for variety, default 5 |
| small_seq_prob | float | No | Probability of generating shorter sequences, default 0.1 |
Outputs
| Name | Type | Description |
|---|---|---|
| instances | list[TokenInstance] | List of pretraining instances with token_a, token_b, and is_next label |
| tokens_a | list[str] | First segment tokens for a training instance |
| tokens_b | list[str] | Second segment tokens for a training instance |
| is_next | int | Label: 0 if tokens_b follows tokens_a, 1 if random |
Usage Examples
from turing.sources import PretrainingDataCreator, WikiPretrainingDataCreator
from pytorch_pretrained_bert.tokenization import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Create pretraining data from raw text
creator = WikiPretrainingDataCreator(
path="/data/wikipedia/wiki_text.txt",
tokenizer=tokenizer,
max_seq_length=512,
dupe_factor=5
)
# Access instances
for i in range(len(creator)):
tokens_a, tokens_b, is_next = creator.instances[i].get_values()
# Save and reload processed data
creator.save("/data/processed/wiki_instances.pkl")
loaded_creator = PretrainingDataCreator.load("/data/processed/wiki_instances.pkl")