Implementation:Tensorflow Tfjs GPT2Preprocessor Constructor
Appearance
Summary
GPT2Preprocessor prepares raw text for GPT-2 model input by tokenizing, adding start/end tokens, padding, and generating attention masks. It combines GPT2Tokenizer for tokenization with StartEndPacker for sequence packing into fixed-length tensors.
API
new GPT2Preprocessor(args: GPT2PreprocessorArgs) + StartEndPacker
Source
tfjs-layers/src/layers/nlp/models/gpt2/gpt2_preprocessor.ts:L126-218(GPT2Preprocessor)tfjs-layers/src/layers/nlp/preprocessing/start_end_packer.ts:L86-198(StartEndPacker)
Type
API Doc
Signatures
GPT2Preprocessor
interface GPT2PreprocessorArgs extends LayerArgs {
tokenizer: GPT2Tokenizer;
sequenceLength?: number; // default 1024
addStartToken?: boolean; // default true
addEndToken?: boolean; // default true
}
class GPT2Preprocessor extends Preprocessor {
constructor(args: GPT2PreprocessorArgs)
call(inputs: Tensor|Tensor[], kwargs: GPT2PreprocessorOptions): Tensor|Tensor[]
callAndPackArgs(inputs, kwargs): NamedTensorMap | [NamedTensorMap, Tensor] | [NamedTensorMap, Tensor, Tensor]
}
StartEndPacker
interface StartEndPackerArgs extends LayerArgs {
sequenceLength: number;
startValue?: number|string;
endValue?: number|string;
padValue?: number|string;
}
class StartEndPacker extends Layer {
call(inputs: Tensor|Tensor[], kwargs?: StartEndPackerOptions): Tensor|Tensor2D
callAndReturnPaddingMask(inputs, kwargs?): [Tensor1D|Tensor2D, Tensor1D|Tensor2D]
}
Constructor Parameters
GPT2PreprocessorArgs
| Parameter | Type | Default | Description |
|---|---|---|---|
tokenizer |
GPT2Tokenizer |
(required) | The GPT-2 tokenizer instance for encoding text |
sequenceLength |
number |
1024 | Target length for all output sequences |
addStartToken |
boolean |
true | Whether to prepend the start-of-sequence token |
addEndToken |
boolean |
true | Whether to append the end-of-sequence token |
StartEndPackerArgs
| Parameter | Type | Default | Description |
|---|---|---|---|
sequenceLength |
number |
(required) | Fixed output length for packed sequences |
startValue |
string | undefined | Value to prepend as start token |
endValue |
string | undefined | Value to append as end token |
padValue |
string | undefined | Value to use for padding |
Methods
| Method | Signature | Description |
|---|---|---|
call |
Tensor[], kwargs: GPT2PreprocessorOptions): Tensor | Tensor[] | Preprocesses input text tensors |
callAndPackArgs |
[NamedTensorMap, Tensor] | [NamedTensorMap, Tensor, Tensor] | Preprocesses and packs into named tensor map with token IDs and padding mask |
I/O
- Inputs:
GPT2Tokenizerinstance + raw text strings as aTensorof dtype'string' - Outputs:
NamedTensorMapcontaining:tokenIds:Tensor2Dof padded token IDspaddingMask:Tensor2Dwhere 1 = real token, 0 = padding
Example
const preprocessor = new GPT2Preprocessor({
tokenizer: tokenizer,
sequenceLength: 128,
addStartToken: true,
addEndToken: true,
});
const text = tf.tensor1d(['Once upon a time'], 'string');
const processed = preprocessor.callAndPackArgs(text, {});
// processed.tokenIds: [1, 7454, 2402, 257, 640, 2, 0, 0, ...]
// processed.paddingMask: [1, 1, 1, 1, 1, 1, 0, 0, ...]
Implements
Principle:Tensorflow_Tfjs_Sequence_Preprocessing
Environment:Tensorflow_Tfjs_Browser_Runtime
Domains
Sources
Related Pages
Environments
- Environment:Tensorflow_Tfjs_Browser_Runtime -- Browser runtime (WebGL / WebGPU / WASM / CPU backends)
Metadata
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment