Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Tensorflow Tfjs GPT2Tokenizer Constructor

From Leeroopedia


Summary

GPT2Tokenizer is a byte-level BPE tokenizer for GPT-2 models in TensorFlow.js. It extends BytePairTokenizer and provides tokenization and detokenization of text using a vocabulary map and ordered merge rules. The constructor accepts a vocabulary (token-to-ID mapping) and a list of BPE merge rules.

API

new GPT2Tokenizer(args: GPT2TokenizerArgs) extending BytePairTokenizer

Source

  • tfjs-layers/src/layers/nlp/models/gpt2/gpt2_tokenizer.ts:L66-116 (GPT2Tokenizer)
  • tfjs-layers/src/layers/nlp/tokenizers.ts:L233-571 (BytePairTokenizer)

Type

API Doc

Signatures

GPT2Tokenizer

// GPT2TokenizerArgs extends LayerArgs
interface GPT2TokenizerArgs extends LayerArgs {
  vocabulary: Map<string, number>;  // token-to-id mapping
  merges: string[];                 // BPE merge rules ordered by priority
}

class GPT2Tokenizer extends BytePairTokenizer {
  constructor(args: GPT2TokenizerArgs)
  get endTokenId(): number
  get startTokenId(): number
  get padTokenId(): number
}

BytePairTokenizer (Parent Class)

interface BytePairTokenizerArgs extends LayerArgs {
  vocabulary: Map<string, number>;
  merges: string[];
  sequenceLength?: number;
  addPrefixSpace?: boolean;
  unsplittableTokens?: string[];
}

class BytePairTokenizer extends Tokenizer {
  tokenize(inputs: Tensor): Tensor[]
  detokenize(inputs: Tensor[]): Tensor
  get vocabulary(): string[]
  get vocabularySize(): number
  idToToken(id: number): string | undefined
  tokenToId(token: string): number | undefined
}

Constructor Parameters

Parameter Type Required Description
vocabulary Map<string, number> Yes Token-to-ID mapping for the full vocabulary
merges string[] Yes BPE merge rules ordered by priority (most frequent pairs first)

Properties

Property Return Type Description
endTokenId number The token ID for the end-of-text token (<|endoftext|>)
startTokenId number The token ID for the start token (same as end token in GPT-2)
padTokenId number The token ID for the padding token
vocabulary string[] List of all tokens in the vocabulary (inherited)
vocabularySize number Total number of tokens in the vocabulary (inherited)

Methods (Inherited from BytePairTokenizer)

Method Signature Description
tokenize tokenize(inputs: Tensor): Tensor[] Tokenizes input text tensor into arrays of token ID tensors
detokenize detokenize(inputs: Tensor[]): Tensor Converts token ID tensors back to text tensor
idToToken undefined Looks up the token string for a given ID
tokenToId undefined Looks up the ID for a given token string

Import

GPT2Tokenizer is constructed internally by GPT2Preprocessor or loaded from configuration. It is not typically imported directly by end users.

I/O

  • Inputs: vocabulary as Map<string, number>, merges as string[]
  • Outputs: A GPT2Tokenizer instance with tokenize and detokenize methods
    • tokenize(Tensor)Tensor[] of token IDs
    • detokenize(Tensor[])Tensor of text

Example

const tokenizer = new GPT2Tokenizer({
  vocabulary: vocabMap,  // Map<string, number>
  merges: mergesList,    // string[]
});

// Tokenize
const inputText = tf.tensor1d(['Hello, world!'], 'string');
const tokenIds = tokenizer.tokenize(inputText);

// Detokenize
const decoded = tokenizer.detokenize(tokenIds);

Implements

Principle:Tensorflow_Tfjs_BPE_Tokenization

Environment:Tensorflow_Tfjs_Browser_Runtime

Domains

NLP Tokenization

Sources

TensorFlow.js

Related Pages

Environments

Metadata

2026-02-10 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment