Implementation:Microsoft DeepSpeedExamples BingBert Tokenization

Knowledge Sources	Microsoft_DeepSpeedExamples
Domains	Tokenization, NLP
Last Updated	2026-02-07 12:00 GMT

Overview

BERT tokenization classes implementing end-to-end text tokenization with basic tokenization and WordPiece subword splitting for the Bing BERT pipeline.

Description

This module provides the BertTokenizer class that performs end-to-end tokenization by combining punctuation-based basic tokenization with WordPiece subword splitting. It supports loading vocabulary files from local paths or downloading pretrained vocabularies from HuggingFace model archives for various BERT variants (base/large, cased/uncased, multilingual, Chinese).

The BasicTokenizer handles Unicode normalization, whitespace tokenization, punctuation splitting, accent stripping, and optional lowercasing. It respects a configurable set of never-split tokens (such as [UNK], [SEP], [PAD], [CLS], [MASK]) to preserve special tokens through the tokenization pipeline.

The WordpieceTokenizer implements the WordPiece algorithm that greedily splits tokens into the longest matching subword pieces from the vocabulary, prefixing continuation pieces with "##". This enables handling of out-of-vocabulary words by decomposing them into known subword units.

Usage

Use this module for tokenizing text inputs before feeding them into any BERT model within the Bing BERT training or inference pipeline. It is the standard tokenization component required by all BERT model variants in this codebase.

Code Reference

Source Location

Repository: Microsoft_DeepSpeedExamples
File: training/bing_bert/pytorch_pretrained_bert/tokenization.py
Lines: 1-386

Signature

def load_vocab(vocab_file)
def whitespace_tokenize(text)

class BertTokenizer(object):
    def __init__(self, vocab_file, do_lower_case=True, max_len=None, never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"))
    def tokenize(self, text)
    def convert_tokens_to_ids(self, tokens)
    def convert_ids_to_tokens(self, ids)
    @classmethod
    def from_pretrained(cls, pretrained_model_name, cache_dir=None, *inputs, **kwargs)

class BasicTokenizer(object):
    def __init__(self, do_lower_case=True, never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"))
    def tokenize(self, text)

class WordpieceTokenizer(object):
    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200)
    def tokenize(self, text)

Import

from pytorch_pretrained_bert.tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer

I/O Contract

Inputs

Name	Type	Required	Description
vocab_file	str	Yes	Path to vocabulary file or pretrained model name
text	str	Yes	Raw text string to tokenize
do_lower_case	bool	No	Whether to lowercase input text, default True
max_len	int	No	Maximum sequence length, raises error if exceeded
never_split	tuple	No	Special tokens to never split during tokenization

Outputs

Name	Type	Description
tokens	list[str]	List of WordPiece tokens from tokenize()
ids	list[int]	List of vocabulary indices from convert_tokens_to_ids()
vocab	OrderedDict	Token-to-index mapping loaded from vocabulary file

Usage Examples

from pytorch_pretrained_bert.tokenization import BertTokenizer

# Load from pretrained model name
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize text
tokens = tokenizer.tokenize("Hello, how are you?")
# ['hello', ',', 'how', 'are', 'you', '?']

# Convert to IDs
ids = tokenizer.convert_tokens_to_ids(tokens)

# Convert back to tokens
recovered_tokens = tokenizer.convert_ids_to_tokens(ids)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment