Implementation:Microsoft DeepSpeedExamples BingBert Tokenization
| Knowledge Sources | |
|---|---|
| Domains | Tokenization, NLP |
| Last Updated | 2026-02-07 12:00 GMT |
Overview
BERT tokenization classes implementing end-to-end text tokenization with basic tokenization and WordPiece subword splitting for the Bing BERT pipeline.
Description
This module provides the BertTokenizer class that performs end-to-end tokenization by combining punctuation-based basic tokenization with WordPiece subword splitting. It supports loading vocabulary files from local paths or downloading pretrained vocabularies from HuggingFace model archives for various BERT variants (base/large, cased/uncased, multilingual, Chinese).
The BasicTokenizer handles Unicode normalization, whitespace tokenization, punctuation splitting, accent stripping, and optional lowercasing. It respects a configurable set of never-split tokens (such as [UNK], [SEP], [PAD], [CLS], [MASK]) to preserve special tokens through the tokenization pipeline.
The WordpieceTokenizer implements the WordPiece algorithm that greedily splits tokens into the longest matching subword pieces from the vocabulary, prefixing continuation pieces with "##". This enables handling of out-of-vocabulary words by decomposing them into known subword units.
Usage
Use this module for tokenizing text inputs before feeding them into any BERT model within the Bing BERT training or inference pipeline. It is the standard tokenization component required by all BERT model variants in this codebase.
Code Reference
Source Location
- Repository: Microsoft_DeepSpeedExamples
- File: training/bing_bert/pytorch_pretrained_bert/tokenization.py
- Lines: 1-386
Signature
def load_vocab(vocab_file)
def whitespace_tokenize(text)
class BertTokenizer(object):
def __init__(self, vocab_file, do_lower_case=True, max_len=None, never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"))
def tokenize(self, text)
def convert_tokens_to_ids(self, tokens)
def convert_ids_to_tokens(self, ids)
@classmethod
def from_pretrained(cls, pretrained_model_name, cache_dir=None, *inputs, **kwargs)
class BasicTokenizer(object):
def __init__(self, do_lower_case=True, never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"))
def tokenize(self, text)
class WordpieceTokenizer(object):
def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200)
def tokenize(self, text)
Import
from pytorch_pretrained_bert.tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| vocab_file | str | Yes | Path to vocabulary file or pretrained model name |
| text | str | Yes | Raw text string to tokenize |
| do_lower_case | bool | No | Whether to lowercase input text, default True |
| max_len | int | No | Maximum sequence length, raises error if exceeded |
| never_split | tuple | No | Special tokens to never split during tokenization |
Outputs
| Name | Type | Description |
|---|---|---|
| tokens | list[str] | List of WordPiece tokens from tokenize() |
| ids | list[int] | List of vocabulary indices from convert_tokens_to_ids() |
| vocab | OrderedDict | Token-to-index mapping loaded from vocabulary file |
Usage Examples
from pytorch_pretrained_bert.tokenization import BertTokenizer
# Load from pretrained model name
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize text
tokens = tokenizer.tokenize("Hello, how are you?")
# ['hello', ',', 'how', 'are', 'you', '?']
# Convert to IDs
ids = tokenizer.convert_tokens_to_ids(tokens)
# Convert back to tokens
recovered_tokens = tokenizer.convert_ids_to_tokens(ids)