Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Microsoft DeepSpeedExamples BingBert Tokenization

From Leeroopedia


Knowledge Sources
Domains Tokenization, NLP
Last Updated 2026-02-07 12:00 GMT

Overview

BERT tokenization classes implementing end-to-end text tokenization with basic tokenization and WordPiece subword splitting for the Bing BERT pipeline.

Description

This module provides the BertTokenizer class that performs end-to-end tokenization by combining punctuation-based basic tokenization with WordPiece subword splitting. It supports loading vocabulary files from local paths or downloading pretrained vocabularies from HuggingFace model archives for various BERT variants (base/large, cased/uncased, multilingual, Chinese).

The BasicTokenizer handles Unicode normalization, whitespace tokenization, punctuation splitting, accent stripping, and optional lowercasing. It respects a configurable set of never-split tokens (such as [UNK], [SEP], [PAD], [CLS], [MASK]) to preserve special tokens through the tokenization pipeline.

The WordpieceTokenizer implements the WordPiece algorithm that greedily splits tokens into the longest matching subword pieces from the vocabulary, prefixing continuation pieces with "##". This enables handling of out-of-vocabulary words by decomposing them into known subword units.

Usage

Use this module for tokenizing text inputs before feeding them into any BERT model within the Bing BERT training or inference pipeline. It is the standard tokenization component required by all BERT model variants in this codebase.

Code Reference

Source Location

Signature

def load_vocab(vocab_file)
def whitespace_tokenize(text)

class BertTokenizer(object):
    def __init__(self, vocab_file, do_lower_case=True, max_len=None, never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"))
    def tokenize(self, text)
    def convert_tokens_to_ids(self, tokens)
    def convert_ids_to_tokens(self, ids)
    @classmethod
    def from_pretrained(cls, pretrained_model_name, cache_dir=None, *inputs, **kwargs)

class BasicTokenizer(object):
    def __init__(self, do_lower_case=True, never_split=("[UNK]", "[SEP]", "[PAD]", "[CLS]", "[MASK]"))
    def tokenize(self, text)

class WordpieceTokenizer(object):
    def __init__(self, vocab, unk_token="[UNK]", max_input_chars_per_word=200)
    def tokenize(self, text)

Import

from pytorch_pretrained_bert.tokenization import BertTokenizer, BasicTokenizer, WordpieceTokenizer

I/O Contract

Inputs

Name Type Required Description
vocab_file str Yes Path to vocabulary file or pretrained model name
text str Yes Raw text string to tokenize
do_lower_case bool No Whether to lowercase input text, default True
max_len int No Maximum sequence length, raises error if exceeded
never_split tuple No Special tokens to never split during tokenization

Outputs

Name Type Description
tokens list[str] List of WordPiece tokens from tokenize()
ids list[int] List of vocabulary indices from convert_tokens_to_ids()
vocab OrderedDict Token-to-index mapping loaded from vocabulary file

Usage Examples

from pytorch_pretrained_bert.tokenization import BertTokenizer

# Load from pretrained model name
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize text
tokens = tokenizer.tokenize("Hello, how are you?")
# ['hello', ',', 'how', 'are', 'you', '?']

# Convert to IDs
ids = tokenizer.convert_tokens_to_ids(tokens)

# Convert back to tokens
recovered_tokens = tokenizer.convert_ids_to_tokens(ids)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment