Implementation:Tensorflow Tfjs GPT2Preprocessor Constructor

Summary

GPT2Preprocessor prepares raw text for GPT-2 model input by tokenizing, adding start/end tokens, padding, and generating attention masks. It combines GPT2Tokenizer for tokenization with StartEndPacker for sequence packing into fixed-length tensors.

API

new GPT2Preprocessor(args: GPT2PreprocessorArgs) + StartEndPacker

Source

tfjs-layers/src/layers/nlp/models/gpt2/gpt2_preprocessor.ts:L126-218 (GPT2Preprocessor)
tfjs-layers/src/layers/nlp/preprocessing/start_end_packer.ts:L86-198 (StartEndPacker)

Type

API Doc

Signatures

GPT2Preprocessor

interface GPT2PreprocessorArgs extends LayerArgs {
  tokenizer: GPT2Tokenizer;
  sequenceLength?: number;  // default 1024
  addStartToken?: boolean;  // default true
  addEndToken?: boolean;    // default true
}

class GPT2Preprocessor extends Preprocessor {
  constructor(args: GPT2PreprocessorArgs)
  call(inputs: Tensor|Tensor[], kwargs: GPT2PreprocessorOptions): Tensor|Tensor[]
  callAndPackArgs(inputs, kwargs): NamedTensorMap | [NamedTensorMap, Tensor] | [NamedTensorMap, Tensor, Tensor]
}

StartEndPacker

interface StartEndPackerArgs extends LayerArgs {
  sequenceLength: number;
  startValue?: number|string;
  endValue?: number|string;
  padValue?: number|string;
}

class StartEndPacker extends Layer {
  call(inputs: Tensor|Tensor[], kwargs?: StartEndPackerOptions): Tensor|Tensor2D
  callAndReturnPaddingMask(inputs, kwargs?): [Tensor1D|Tensor2D, Tensor1D|Tensor2D]
}

Constructor Parameters

GPT2PreprocessorArgs

Parameter	Type	Default	Description
`tokenizer`	`GPT2Tokenizer`	(required)	The GPT-2 tokenizer instance for encoding text
`sequenceLength`	`number`	1024	Target length for all output sequences
`addStartToken`	`boolean`	true	Whether to prepend the start-of-sequence token
`addEndToken`	`boolean`	true	Whether to append the end-of-sequence token

StartEndPackerArgs

Parameter	Type	Default	Description
`sequenceLength`	`number`	(required)	Fixed output length for packed sequences
`startValue`	string	undefined	Value to prepend as start token
`endValue`	string	undefined	Value to append as end token
`padValue`	string	undefined	Value to use for padding

Methods

Method	Signature	Description
`call`	Tensor[], kwargs: GPT2PreprocessorOptions): Tensor \| Tensor[]	Preprocesses input text tensors
`callAndPackArgs`	[NamedTensorMap, Tensor] \| [NamedTensorMap, Tensor, Tensor]	Preprocesses and packs into named tensor map with token IDs and padding mask

I/O

Inputs: GPT2Tokenizer instance + raw text strings as a Tensor of dtype 'string'
Outputs: NamedTensorMap containing:
- tokenIds: Tensor2D of padded token IDs
- paddingMask: Tensor2D where 1 = real token, 0 = padding

Example

const preprocessor = new GPT2Preprocessor({
  tokenizer: tokenizer,
  sequenceLength: 128,
  addStartToken: true,
  addEndToken: true,
});

const text = tf.tensor1d(['Once upon a time'], 'string');
const processed = preprocessor.callAndPackArgs(text, {});
// processed.tokenIds: [1, 7454, 2402, 257, 640, 2, 0, 0, ...]
// processed.paddingMask: [1, 1, 1, 1, 1, 1, 0, 0, ...]

Implements

Principle:Tensorflow_Tfjs_Sequence_Preprocessing

Environment:Tensorflow_Tfjs_Browser_Runtime

Domains

NLP Data_Preprocessing

Sources

TensorFlow.js

Related Pages

Environments

Environment:Tensorflow_Tfjs_Browser_Runtime -- Browser runtime (WebGL / WebGPU / WASM / CPU backends)

Metadata

2026-02-10 00:00 GMT

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment