Implementation:Allenai Open instruct TokenizerConfig

Knowledge Sources	Open Instruct
Domains	Machine Learning, Natural Language Processing
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for encapsulating tokenizer settings into a structured dataclass provided by the Open Instruct library.

Description

The TokenizerConfig dataclass holds all configuration needed to instantiate and use a tokenizer for LLM fine-tuning. It stores the tokenizer path, revision, chat template preference, BOS token handling, and the function used to initialize the tokenizer. The tokenizer itself is exposed as a @cached_property, meaning it is lazily loaded on first access and then cached for all subsequent uses. On initialization, it also computes file hashes of the tokenizer configuration files for cache invalidation and reproducibility tracking.

Usage

Create a TokenizerConfig instance at the start of the training pipeline and pass it to dataset preparation functions and the training loop. It is typically constructed from command-line arguments or a configuration file.

Code Reference

Source Location

Repository: Open Instruct
File: open_instruct/dataset_transformation.py
Lines: L870-908

Signature

@dataclass
class TokenizerConfig:
    tokenizer_name_or_path: str | None = None
    tokenizer_revision: str | None = None
    trust_remote_code: bool = False
    use_fast: bool = True
    chat_template_name: str | None = None
    add_bos: bool = False
    get_tokenizer_fn: str = "get_tokenizer_tulu_v2_2"
    tokenizer_files_hash: list[str] | None = None
    use_slow_tokenizer: bool = False
    tokenizer_name: str | None = None
    ground_truths_key: str = "ground_truth"
    sft_messages_key: str = "messages"

    @cached_property
    def tokenizer(self) -> PreTrainedTokenizer:
        ...

Import

from open_instruct.dataset_transformation import TokenizerConfig

I/O Contract

Inputs

Name	Type	Required	Description
tokenizer_name_or_path	str or None	Yes (for tokenizer use)	Path to a pretrained tokenizer or HuggingFace model ID (e.g., `"allenai/Llama-3.1-Tulu-3-8B"`).
tokenizer_revision	str or None	No	Specific revision (branch, tag, or commit) of the tokenizer to use.
trust_remote_code	bool	No	Whether to trust remote code when loading the tokenizer. Defaults to False.
use_fast	bool	No	Whether to use the fast (Rust-based) tokenizer. Defaults to True.
chat_template_name	str or None	No	Name of the chat template to apply. If None, uses the tokenizer's built-in template.
add_bos	bool	No	Whether to explicitly add a BOS token. Defaults to False.
get_tokenizer_fn	str	No	Name of the function used to initialize the tokenizer. Defaults to `"get_tokenizer_tulu_v2_2"`.
tokenizer_files_hash	list[str] or None	No	Hash of tokenizer files, auto-computed on first access. Used for cache key computation.
use_slow_tokenizer	bool	No	Backward-compatible flag, completely ignored in current code.
tokenizer_name	str or None	No	Deprecated alias for `tokenizer_name_or_path`.
ground_truths_key	str	No	Column name for ground truth data. Defaults to `"ground_truth"`.
sft_messages_key	str	No	Column name for SFT message data. Defaults to `"messages"`.

Outputs

Name	Type	Description
tokenizer (property)	PreTrainedTokenizer	The fully initialized HuggingFace tokenizer instance with the configured chat template and special tokens.

Usage Examples

Basic Usage

from open_instruct.dataset_transformation import TokenizerConfig

tc = TokenizerConfig(
    tokenizer_name_or_path="allenai/Llama-3.1-Tulu-3-8B",
    chat_template_name="tulu",
)

# Access the tokenizer (lazy-loaded on first access)
tokenizer = tc.tokenizer
print(tokenizer.vocab_size)

# Apply chat template
messages = [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"},
]
token_ids = tokenizer.apply_chat_template(messages)

With Custom Settings

tc = TokenizerConfig(
    tokenizer_name_or_path="meta-llama/Llama-3.1-8B",
    add_bos=True,
    trust_remote_code=False,
    chat_template_name=None,  # use tokenizer's built-in template
)

Related Pages

Implements Principle

Principle:Allenai_Open_instruct_Tokenizer_Configuration

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment