Implementation:Gretelai Gretel synthetics Tokenizer Training Pipeline
| Knowledge Sources | |
|---|---|
| Domains | Synthetic_Data, Natural_Language_Processing, Tokenization |
| Last Updated | 2026-02-14 19:00 GMT |
Overview
Concrete tool for annotating raw training data and building tokenizer vocabularies provided by the gretel-synthetics library.
Description
The tokenizer training pipeline is implemented through a class hierarchy rooted at BaseTokenizerTrainer. This abstract base class defines the two-phase workflow: annotate_data() reads the raw input, applies per-line transformations, and writes an annotated training file; train() then builds the vocabulary model and saves it alongside its settings to the checkpoint directory.
Two concrete implementations are provided:
- CharTokenizerTrainer: Scans the annotated data to collect all unique characters, sorts them, and builds a simple char-to-index / index-to-char mapping serialized via cloudpickle.
- SentencePieceTokenizerTrainer: Replaces field delimiters with special tokens (e.g., <d>) and appends newline tokens (<n>) during annotation, then delegates to Google SentencePiece for vocabulary learning with configurable vocab size, character coverage, and sentence limits.
All trainers persist their settings to a JSON file (tokenizer_params.json) in the checkpoint directory so that the corresponding tokenizer class can reload them for encoding and decoding during generation.
Usage
Use CharTokenizerTrainer when working with data that has a small, well-defined character set or when you want character-level generation (set vocab_size=0 in the config). Use SentencePieceTokenizerTrainer (the default) for natural language or structured data where subword tokenization improves generation quality and reduces sequence length.
Code Reference
Source Location
- Repository: gretel-synthetics
- File:
src/gretel_synthetics/tokenizers.py - Lines: BaseTokenizerTrainer L101--210, CharTokenizerTrainer L324--360, SentencePieceTokenizerTrainer L421--532
Signature
BaseTokenizerTrainer:
class BaseTokenizerTrainer(Base):
vocab_size: int
config: BaseConfig
num_lines: int = 0
def __init__(self, *, config: BaseConfig, vocab_size: Optional[int] = None):
...
def annotate_data(self) -> Iterator[str]:
...
def train(self):
...
def data_iterator(self) -> Iterator[str]:
...
CharTokenizerTrainer:
class CharTokenizerTrainer(BaseTokenizerTrainer):
newline_str: str = "\n"
def _train(self):
...
def _get_save_settings(self):
return {"vocab_size": self.vocab_size}
SentencePieceTokenizerTrainer:
class SentencePieceTokenizerTrainer(BaseTokenizerTrainer):
vocab_size: int
character_coverage: float
pretrain_sentence_count: int
max_line_line: int
newline_str: str = "<n>"
def __init__(
self,
*,
character_coverage: float = 1.0,
pretrain_sentence_count: int = 1000000,
max_line_len: int = 2048,
**kwargs,
):
...
Import
from gretel_synthetics.tokenizers import (
BaseTokenizerTrainer,
CharTokenizerTrainer,
SentencePieceTokenizerTrainer,
)
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | BaseConfig | Yes | Configuration object providing input_data_path, training_data_path, checkpoint_dir, field_delimiter, and field_delimiter_token |
| vocab_size | int | No | Maximum vocabulary size. For CharTokenizerTrainer, None means use all unique characters. For SentencePieceTokenizerTrainer, defaults to 20000 |
| character_coverage | float | No | Fraction of characters to cover (SentencePiece only, default: 1.0) |
| pretrain_sentence_count | int | No | Number of input lines for SentencePiece to load (default: 1000000) |
| max_line_len | int | No | Maximum line length for SentencePiece input (default: 2048) |
Outputs
| Name | Type | Description |
|---|---|---|
| Annotated training file | file | A processed training data file written to config.training_data_path |
| Tokenizer model files | files | For CharTokenizerTrainer: char2idx.p and idx2char.p (cloudpickle). For SentencePieceTokenizerTrainer: m.model and m.vocab |
| tokenizer_params.json | file | JSON file containing tokenizer settings, saved in config.checkpoint_dir |
| data_iterator | Iterator[str] | A generator that yields lines from the annotated training data file (returned by annotate_data()) |
Usage Examples
Character Tokenizer Example
from gretel_synthetics.config import TensorFlowConfig
from gretel_synthetics.tokenizers import CharTokenizerTrainer
config = TensorFlowConfig(
input_data_path="/path/to/data.txt",
checkpoint_dir="/path/to/model",
vocab_size=0,
)
trainer = CharTokenizerTrainer(config=config)
trainer.annotate_data()
trainer.train()
SentencePiece Tokenizer Example
from gretel_synthetics.config import TensorFlowConfig
from gretel_synthetics.tokenizers import SentencePieceTokenizerTrainer
config = TensorFlowConfig(
input_data_path="/path/to/data.csv",
checkpoint_dir="/path/to/model",
field_delimiter=",",
)
trainer = SentencePieceTokenizerTrainer(
vocab_size=20000,
character_coverage=1.0,
pretrain_sentence_count=1000000,
config=config,
)
trainer.annotate_data()
trainer.train()