Implementation:Allenai Open instruct TokenizerConfig
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for encapsulating tokenizer settings into a structured dataclass provided by the Open Instruct library.
Description
The TokenizerConfig dataclass holds all configuration needed to instantiate and use a tokenizer for LLM fine-tuning. It stores the tokenizer path, revision, chat template preference, BOS token handling, and the function used to initialize the tokenizer. The tokenizer itself is exposed as a @cached_property, meaning it is lazily loaded on first access and then cached for all subsequent uses. On initialization, it also computes file hashes of the tokenizer configuration files for cache invalidation and reproducibility tracking.
Usage
Create a TokenizerConfig instance at the start of the training pipeline and pass it to dataset preparation functions and the training loop. It is typically constructed from command-line arguments or a configuration file.
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/dataset_transformation.py - Lines: L870-908
Signature
@dataclass
class TokenizerConfig:
tokenizer_name_or_path: str | None = None
tokenizer_revision: str | None = None
trust_remote_code: bool = False
use_fast: bool = True
chat_template_name: str | None = None
add_bos: bool = False
get_tokenizer_fn: str = "get_tokenizer_tulu_v2_2"
tokenizer_files_hash: list[str] | None = None
use_slow_tokenizer: bool = False
tokenizer_name: str | None = None
ground_truths_key: str = "ground_truth"
sft_messages_key: str = "messages"
@cached_property
def tokenizer(self) -> PreTrainedTokenizer:
...
Import
from open_instruct.dataset_transformation import TokenizerConfig
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| tokenizer_name_or_path | str or None | Yes (for tokenizer use) | Path to a pretrained tokenizer or HuggingFace model ID (e.g., "allenai/Llama-3.1-Tulu-3-8B").
|
| tokenizer_revision | str or None | No | Specific revision (branch, tag, or commit) of the tokenizer to use. |
| trust_remote_code | bool | No | Whether to trust remote code when loading the tokenizer. Defaults to False. |
| use_fast | bool | No | Whether to use the fast (Rust-based) tokenizer. Defaults to True. |
| chat_template_name | str or None | No | Name of the chat template to apply. If None, uses the tokenizer's built-in template. |
| add_bos | bool | No | Whether to explicitly add a BOS token. Defaults to False. |
| get_tokenizer_fn | str | No | Name of the function used to initialize the tokenizer. Defaults to "get_tokenizer_tulu_v2_2".
|
| tokenizer_files_hash | list[str] or None | No | Hash of tokenizer files, auto-computed on first access. Used for cache key computation. |
| use_slow_tokenizer | bool | No | Backward-compatible flag, completely ignored in current code. |
| tokenizer_name | str or None | No | Deprecated alias for tokenizer_name_or_path.
|
| ground_truths_key | str | No | Column name for ground truth data. Defaults to "ground_truth".
|
| sft_messages_key | str | No | Column name for SFT message data. Defaults to "messages".
|
Outputs
| Name | Type | Description |
|---|---|---|
| tokenizer (property) | PreTrainedTokenizer | The fully initialized HuggingFace tokenizer instance with the configured chat template and special tokens. |
Usage Examples
Basic Usage
from open_instruct.dataset_transformation import TokenizerConfig
tc = TokenizerConfig(
tokenizer_name_or_path="allenai/Llama-3.1-Tulu-3-8B",
chat_template_name="tulu",
)
# Access the tokenizer (lazy-loaded on first access)
tokenizer = tc.tokenizer
print(tokenizer.vocab_size)
# Apply chat template
messages = [
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"},
]
token_ids = tokenizer.apply_chat_template(messages)
With Custom Settings
tc = TokenizerConfig(
tokenizer_name_or_path="meta-llama/Llama-3.1-8B",
add_bos=True,
trust_remote_code=False,
chat_template_name=None, # use tokenizer's built-in template
)