Implementation:Tensorflow Tfjs GPT2Tokenizer Constructor
Summary
GPT2Tokenizer is a byte-level BPE tokenizer for GPT-2 models in TensorFlow.js. It extends BytePairTokenizer and provides tokenization and detokenization of text using a vocabulary map and ordered merge rules. The constructor accepts a vocabulary (token-to-ID mapping) and a list of BPE merge rules.
API
new GPT2Tokenizer(args: GPT2TokenizerArgs) extending BytePairTokenizer
Source
tfjs-layers/src/layers/nlp/models/gpt2/gpt2_tokenizer.ts:L66-116(GPT2Tokenizer)tfjs-layers/src/layers/nlp/tokenizers.ts:L233-571(BytePairTokenizer)
Type
API Doc
Signatures
GPT2Tokenizer
// GPT2TokenizerArgs extends LayerArgs
interface GPT2TokenizerArgs extends LayerArgs {
vocabulary: Map<string, number>; // token-to-id mapping
merges: string[]; // BPE merge rules ordered by priority
}
class GPT2Tokenizer extends BytePairTokenizer {
constructor(args: GPT2TokenizerArgs)
get endTokenId(): number
get startTokenId(): number
get padTokenId(): number
}
BytePairTokenizer (Parent Class)
interface BytePairTokenizerArgs extends LayerArgs {
vocabulary: Map<string, number>;
merges: string[];
sequenceLength?: number;
addPrefixSpace?: boolean;
unsplittableTokens?: string[];
}
class BytePairTokenizer extends Tokenizer {
tokenize(inputs: Tensor): Tensor[]
detokenize(inputs: Tensor[]): Tensor
get vocabulary(): string[]
get vocabularySize(): number
idToToken(id: number): string | undefined
tokenToId(token: string): number | undefined
}
Constructor Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
vocabulary |
Map<string, number> |
Yes | Token-to-ID mapping for the full vocabulary |
merges |
string[] |
Yes | BPE merge rules ordered by priority (most frequent pairs first) |
Properties
| Property | Return Type | Description |
|---|---|---|
endTokenId |
number |
The token ID for the end-of-text token (<|endoftext|>)
|
startTokenId |
number |
The token ID for the start token (same as end token in GPT-2) |
padTokenId |
number |
The token ID for the padding token |
vocabulary |
string[] |
List of all tokens in the vocabulary (inherited) |
vocabularySize |
number |
Total number of tokens in the vocabulary (inherited) |
Methods (Inherited from BytePairTokenizer)
| Method | Signature | Description |
|---|---|---|
tokenize |
tokenize(inputs: Tensor): Tensor[] |
Tokenizes input text tensor into arrays of token ID tensors |
detokenize |
detokenize(inputs: Tensor[]): Tensor |
Converts token ID tensors back to text tensor |
idToToken |
undefined | Looks up the token string for a given ID |
tokenToId |
undefined | Looks up the ID for a given token string |
Import
GPT2Tokenizer is constructed internally by GPT2Preprocessor or loaded from configuration. It is not typically imported directly by end users.
I/O
- Inputs:
vocabularyasMap<string, number>,mergesasstring[] - Outputs: A
GPT2Tokenizerinstance withtokenizeanddetokenizemethodstokenize(Tensor)→Tensor[]of token IDsdetokenize(Tensor[])→Tensorof text
Example
const tokenizer = new GPT2Tokenizer({
vocabulary: vocabMap, // Map<string, number>
merges: mergesList, // string[]
});
// Tokenize
const inputText = tf.tensor1d(['Hello, world!'], 'string');
const tokenIds = tokenizer.tokenize(inputText);
// Detokenize
const decoded = tokenizer.detokenize(tokenIds);
Implements
Principle:Tensorflow_Tfjs_BPE_Tokenization
Environment:Tensorflow_Tfjs_Browser_Runtime
Domains
Sources
Related Pages
Environments
- Environment:Tensorflow_Tfjs_Browser_Runtime -- Browser runtime (WebGL / WebGPU / WASM / CPU backends)