Implementation:Gretelai Gretel synthetics Tokenizer Training Pipeline

Knowledge Sources	gretel-synthetics
Domains	Synthetic_Data, Natural_Language_Processing, Tokenization
Last Updated	2026-02-14 19:00 GMT

Overview

Concrete tool for annotating raw training data and building tokenizer vocabularies provided by the gretel-synthetics library.

Description

The tokenizer training pipeline is implemented through a class hierarchy rooted at BaseTokenizerTrainer. This abstract base class defines the two-phase workflow: annotate_data() reads the raw input, applies per-line transformations, and writes an annotated training file; train() then builds the vocabulary model and saves it alongside its settings to the checkpoint directory.

Two concrete implementations are provided:

CharTokenizerTrainer: Scans the annotated data to collect all unique characters, sorts them, and builds a simple char-to-index / index-to-char mapping serialized via cloudpickle.
SentencePieceTokenizerTrainer: Replaces field delimiters with special tokens (e.g., <d>) and appends newline tokens (<n>) during annotation, then delegates to Google SentencePiece for vocabulary learning with configurable vocab size, character coverage, and sentence limits.

All trainers persist their settings to a JSON file (tokenizer_params.json) in the checkpoint directory so that the corresponding tokenizer class can reload them for encoding and decoding during generation.

Usage

Use CharTokenizerTrainer when working with data that has a small, well-defined character set or when you want character-level generation (set vocab_size=0 in the config). Use SentencePieceTokenizerTrainer (the default) for natural language or structured data where subword tokenization improves generation quality and reduces sequence length.

Code Reference

Source Location

Repository: gretel-synthetics
File: src/gretel_synthetics/tokenizers.py
Lines: BaseTokenizerTrainer L101--210, CharTokenizerTrainer L324--360, SentencePieceTokenizerTrainer L421--532

Signature

BaseTokenizerTrainer:

class BaseTokenizerTrainer(Base):
    vocab_size: int
    config: BaseConfig
    num_lines: int = 0

    def __init__(self, *, config: BaseConfig, vocab_size: Optional[int] = None):
        ...

    def annotate_data(self) -> Iterator[str]:
        ...

    def train(self):
        ...

    def data_iterator(self) -> Iterator[str]:
        ...

CharTokenizerTrainer:

class CharTokenizerTrainer(BaseTokenizerTrainer):
    newline_str: str = "\n"

    def _train(self):
        ...

    def _get_save_settings(self):
        return {"vocab_size": self.vocab_size}

SentencePieceTokenizerTrainer:

class SentencePieceTokenizerTrainer(BaseTokenizerTrainer):
    vocab_size: int
    character_coverage: float
    pretrain_sentence_count: int
    max_line_line: int
    newline_str: str = "<n>"

    def __init__(
        self,
        *,
        character_coverage: float = 1.0,
        pretrain_sentence_count: int = 1000000,
        max_line_len: int = 2048,
        **kwargs,
    ):
        ...

Import

from gretel_synthetics.tokenizers import (
    BaseTokenizerTrainer,
    CharTokenizerTrainer,
    SentencePieceTokenizerTrainer,
)

I/O Contract

Inputs

Name	Type	Required	Description
config	BaseConfig	Yes	Configuration object providing input_data_path, training_data_path, checkpoint_dir, field_delimiter, and field_delimiter_token
vocab_size	int	No	Maximum vocabulary size. For CharTokenizerTrainer, None means use all unique characters. For SentencePieceTokenizerTrainer, defaults to 20000
character_coverage	float	No	Fraction of characters to cover (SentencePiece only, default: 1.0)
pretrain_sentence_count	int	No	Number of input lines for SentencePiece to load (default: 1000000)
max_line_len	int	No	Maximum line length for SentencePiece input (default: 2048)

Outputs

Name	Type	Description
Annotated training file	file	A processed training data file written to config.training_data_path
Tokenizer model files	files	For CharTokenizerTrainer: char2idx.p and idx2char.p (cloudpickle). For SentencePieceTokenizerTrainer: m.model and m.vocab
tokenizer_params.json	file	JSON file containing tokenizer settings, saved in config.checkpoint_dir
data_iterator	Iterator[str]	A generator that yields lines from the annotated training data file (returned by annotate_data())

Usage Examples

Character Tokenizer Example

from gretel_synthetics.config import TensorFlowConfig
from gretel_synthetics.tokenizers import CharTokenizerTrainer

config = TensorFlowConfig(
    input_data_path="/path/to/data.txt",
    checkpoint_dir="/path/to/model",
    vocab_size=0,
)

trainer = CharTokenizerTrainer(config=config)
trainer.annotate_data()
trainer.train()

SentencePiece Tokenizer Example

from gretel_synthetics.config import TensorFlowConfig
from gretel_synthetics.tokenizers import SentencePieceTokenizerTrainer

config = TensorFlowConfig(
    input_data_path="/path/to/data.csv",
    checkpoint_dir="/path/to/model",
    field_delimiter=",",
)

trainer = SentencePieceTokenizerTrainer(
    vocab_size=20000,
    character_coverage=1.0,
    pretrain_sentence_count=1000000,
    config=config,
)
trainer.annotate_data()
trainer.train()

Related Pages

Implements Principle

Principle:Gretelai_Gretel_synthetics_Tokenizer_Training

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment