Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct TokenizerConfig

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Natural Language Processing
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for encapsulating tokenizer settings into a structured dataclass provided by the Open Instruct library.

Description

The TokenizerConfig dataclass holds all configuration needed to instantiate and use a tokenizer for LLM fine-tuning. It stores the tokenizer path, revision, chat template preference, BOS token handling, and the function used to initialize the tokenizer. The tokenizer itself is exposed as a @cached_property, meaning it is lazily loaded on first access and then cached for all subsequent uses. On initialization, it also computes file hashes of the tokenizer configuration files for cache invalidation and reproducibility tracking.

Usage

Create a TokenizerConfig instance at the start of the training pipeline and pass it to dataset preparation functions and the training loop. It is typically constructed from command-line arguments or a configuration file.

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/dataset_transformation.py
  • Lines: L870-908

Signature

@dataclass
class TokenizerConfig:
    tokenizer_name_or_path: str | None = None
    tokenizer_revision: str | None = None
    trust_remote_code: bool = False
    use_fast: bool = True
    chat_template_name: str | None = None
    add_bos: bool = False
    get_tokenizer_fn: str = "get_tokenizer_tulu_v2_2"
    tokenizer_files_hash: list[str] | None = None
    use_slow_tokenizer: bool = False
    tokenizer_name: str | None = None
    ground_truths_key: str = "ground_truth"
    sft_messages_key: str = "messages"

    @cached_property
    def tokenizer(self) -> PreTrainedTokenizer:
        ...

Import

from open_instruct.dataset_transformation import TokenizerConfig

I/O Contract

Inputs

Name Type Required Description
tokenizer_name_or_path str or None Yes (for tokenizer use) Path to a pretrained tokenizer or HuggingFace model ID (e.g., "allenai/Llama-3.1-Tulu-3-8B").
tokenizer_revision str or None No Specific revision (branch, tag, or commit) of the tokenizer to use.
trust_remote_code bool No Whether to trust remote code when loading the tokenizer. Defaults to False.
use_fast bool No Whether to use the fast (Rust-based) tokenizer. Defaults to True.
chat_template_name str or None No Name of the chat template to apply. If None, uses the tokenizer's built-in template.
add_bos bool No Whether to explicitly add a BOS token. Defaults to False.
get_tokenizer_fn str No Name of the function used to initialize the tokenizer. Defaults to "get_tokenizer_tulu_v2_2".
tokenizer_files_hash list[str] or None No Hash of tokenizer files, auto-computed on first access. Used for cache key computation.
use_slow_tokenizer bool No Backward-compatible flag, completely ignored in current code.
tokenizer_name str or None No Deprecated alias for tokenizer_name_or_path.
ground_truths_key str No Column name for ground truth data. Defaults to "ground_truth".
sft_messages_key str No Column name for SFT message data. Defaults to "messages".

Outputs

Name Type Description
tokenizer (property) PreTrainedTokenizer The fully initialized HuggingFace tokenizer instance with the configured chat template and special tokens.

Usage Examples

Basic Usage

from open_instruct.dataset_transformation import TokenizerConfig

tc = TokenizerConfig(
    tokenizer_name_or_path="allenai/Llama-3.1-Tulu-3-8B",
    chat_template_name="tulu",
)

# Access the tokenizer (lazy-loaded on first access)
tokenizer = tc.tokenizer
print(tokenizer.vocab_size)

# Apply chat template
messages = [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": "4"},
]
token_ids = tokenizer.apply_chat_template(messages)

With Custom Settings

tc = TokenizerConfig(
    tokenizer_name_or_path="meta-llama/Llama-3.1-8B",
    add_bos=True,
    trust_remote_code=False,
    chat_template_name=None,  # use tokenizer's built-in template
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment