Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples BingBert TuringSources

From Leeroopedia


Knowledge Sources
Domains Data Preprocessing, BERT Pretraining
Last Updated 2026-02-07 12:00 GMT

Overview

Turing data source classes for creating BERT pretraining instances from raw text corpora, including Wikipedia, BookCorpus, and NumPy-serialized datasets.

Description

This module provides data source creator classes that transform raw text documents into BERT pretraining instances (token pairs with next-sentence prediction labels). The base PretrainingDataCreator reads documents separated by "<sep>" delimiters, tokenizes them, and creates training instances by splitting documents into segment pairs with both real next-sentence and random next-sentence examples.

Several specialized data creators extend the base class: CleanBodyDataCreator for cleaned web body text, WikiNBookCorpusPretrainingDataCreator for combined Wikipedia and BookCorpus data, and WikiPretrainingDataCreator for Wikipedia-specific preprocessing. Each handles the unique formatting and structure of its source corpus while producing the same TokenInstance output format.

The module also includes NumpyPretrainingDataCreator and NumpyByteInstances for efficient loading of pre-serialized training data stored in NumPy binary format. Additionally, QueryPassageDataset, QueryPassageFineTuningDataset, and QueryInstanceDataset classes handle loading of query-passage pair data for QA and ranking tasks from tab-separated text files.

Usage

Use these data source classes when preparing pretraining data for BERT from various text corpora. They are consumed by the dataset classes in turing/dataset.py to create PyTorch-compatible data loading pipelines.

Code Reference

Source Location

Signature

def truncate_input_sequence(tokens_a, tokens_b, max_num_tokens)

class TokenInstance:
    def __init__(self, tokens_a, tokens_b, is_next, lang="en")
    def get_values(self)
    def get_lang(self)

class QueryPassageDataset
class QueryPassageFineTuningDataset
class QueryInstanceDataset

class PretrainingDataCreator:
    def __init__(self, path, tokenizer, max_seq_length, readin=2000000, dupe_factor=5, small_seq_prob=0.1)
    def create_training_instance(self, index)
    def save(self, filename)
    @staticmethod
    def load(filename)

class CleanBodyDataCreator(PretrainingDataCreator)
class WikiNBookCorpusPretrainingDataCreator(PretrainingDataCreator)
class WikiPretrainingDataCreator(PretrainingDataCreator)
class NumpyByteInstances
class NumpyPretrainingDataCreator

Import

from turing.sources import (
    PretrainingDataCreator, WikiPretrainingDataCreator,
    TokenInstance, QueryPassageDataset,
    NumpyPretrainingDataCreator, CleanBodyDataCreator
)

I/O Contract

Inputs

Name Type Required Description
path str Yes Path to the raw text file or data directory
tokenizer BertTokenizer Yes BERT tokenizer for converting text to subword tokens
max_seq_length int Yes Maximum total sequence length including special tokens
readin int No Maximum number of lines to read from file, default 2000000
dupe_factor int No Number of times to duplicate training instances for variety, default 5
small_seq_prob float No Probability of generating shorter sequences, default 0.1

Outputs

Name Type Description
instances list[TokenInstance] List of pretraining instances with token_a, token_b, and is_next label
tokens_a list[str] First segment tokens for a training instance
tokens_b list[str] Second segment tokens for a training instance
is_next int Label: 0 if tokens_b follows tokens_a, 1 if random

Usage Examples

from turing.sources import PretrainingDataCreator, WikiPretrainingDataCreator
from pytorch_pretrained_bert.tokenization import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Create pretraining data from raw text
creator = WikiPretrainingDataCreator(
    path="/data/wikipedia/wiki_text.txt",
    tokenizer=tokenizer,
    max_seq_length=512,
    dupe_factor=5
)

# Access instances
for i in range(len(creator)):
    tokens_a, tokens_b, is_next = creator.instances[i].get_values()

# Save and reload processed data
creator.save("/data/processed/wiki_instances.pkl")
loaded_creator = PretrainingDataCreator.load("/data/processed/wiki_instances.pkl")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment