Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Vibrantlabsai Ragas Tokenizers

From Leeroopedia
Revision as of 11:56, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Vibrantlabsai_Ragas_Tokenizers.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Tokenization, NLP, Evaluation
Last Updated 2026-02-12 00:00 GMT

Overview

The tokenizers module provides a unified tokenizer abstraction supporting both tiktoken (OpenAI) and HuggingFace tokenizer backends, with lazy initialization and a factory function for easy instantiation.

Description

This module defines a BaseTokenizer abstract class with encode, decode, and count_tokens methods, and two concrete implementations. TiktokenWrapper wraps OpenAI's tiktoken library and can be initialized from a pre-built encoding, a model name, or an encoding name (defaulting to o200k_base). HuggingFaceTokenizer wraps HuggingFace's AutoTokenizer with lazy import of the transformers package, raising a clear error if it is not installed. The module provides a lazy default tokenizer via the _LazyTokenizer class, which defers actual tokenizer creation until the first attribute access, avoiding network calls at import time. The module-level DEFAULT_TOKENIZER constant uses this lazy pattern for backwards compatibility. The get_tokenizer factory function creates the appropriate tokenizer based on a tokenizer_type parameter ("tiktoken" or "huggingface"), making it easy to switch backends.

Usage

Use this module whenever you need to count tokens, encode text to token IDs, or decode token IDs back to text within the Ragas framework. It is used internally by metrics that need token-level operations, such as text splitting or token-based scoring.

Code Reference

Source Location

Signature

class BaseTokenizer(ABC):
    @abstractmethod
    def encode(self, text: str) -> t.List[int]:
    @abstractmethod
    def decode(self, tokens: t.List[int]) -> str:
    def count_tokens(self, text: str) -> int:

class TiktokenWrapper(BaseTokenizer):
    def __init__(
        self,
        encoding: t.Optional[tiktoken.Encoding] = None,
        model_name: t.Optional[str] = None,
        encoding_name: t.Optional[str] = None,
    ):

class HuggingFaceTokenizer(BaseTokenizer):
    def __init__(
        self,
        tokenizer: t.Optional[t.Any] = None,
        model_name: t.Optional[str] = None,
    ):

def get_tokenizer(
    tokenizer_type: str = "tiktoken",
    model_name: t.Optional[str] = None,
    encoding_name: t.Optional[str] = None,
) -> BaseTokenizer:

def get_default_tokenizer() -> TiktokenWrapper:

Import

from ragas.tokenizers import (
    BaseTokenizer,
    TiktokenWrapper,
    HuggingFaceTokenizer,
    get_tokenizer,
    get_default_tokenizer,
    DEFAULT_TOKENIZER,
)

I/O Contract

Inputs (TiktokenWrapper.__init__)

Name Type Required Description
encoding tiktoken.Encoding No A pre-initialized tiktoken encoding object
model_name str No Model name to get encoding for (e.g., "gpt-4", "gpt-3.5-turbo")
encoding_name str No Encoding name (e.g., "cl100k_base", "o200k_base"); if none provided, defaults to "o200k_base"

Inputs (HuggingFaceTokenizer.__init__)

Name Type Required Description
tokenizer Any No A pre-initialized HuggingFace tokenizer instance
model_name str No Model name or path to load tokenizer from (e.g., "meta-llama/Llama-2-7b"); one of tokenizer or model_name must be provided

Inputs (get_tokenizer)

Name Type Required Description
tokenizer_type str No Type of tokenizer: "tiktoken" or "huggingface"; defaults to "tiktoken"
model_name str No Model name for the tokenizer
encoding_name str No Encoding name (only for tiktoken)

Outputs

Name Type Description
encode return List[int] List of token IDs representing the input text
decode return str Text string decoded from the token IDs
count_tokens return int Number of tokens in the input text
get_tokenizer return BaseTokenizer A tokenizer instance of the requested type
get_default_tokenizer return TiktokenWrapper The default tiktoken tokenizer using o200k_base encoding

Key Classes and Functions

Name Description
BaseTokenizer Abstract base class defining the encode/decode/count_tokens interface
TiktokenWrapper Wrapper for OpenAI's tiktoken encodings with support for model-based and encoding-based initialization
HuggingFaceTokenizer Wrapper for HuggingFace tokenizers with lazy import of the transformers package
_LazyTokenizer Internal class that defers tokenizer creation until first attribute access, delegating all operations to get_default_tokenizer
DEFAULT_TOKENIZER Module-level lazy tokenizer instance for backwards compatibility
get_default_tokenizer Returns (and lazily creates) the default o200k_base tokenizer singleton
get_tokenizer Factory function to create a tokenizer by type, model, and encoding

Usage Examples

Basic Usage

from ragas.tokenizers import get_tokenizer

# Get default tiktoken tokenizer
tokenizer = get_tokenizer()

# Encode text
tokens = tokenizer.encode("Hello, world!")
print(f"Token IDs: {tokens}")
print(f"Token count: {tokenizer.count_tokens('Hello, world!')}")

# Decode back
text = tokenizer.decode(tokens)
print(f"Decoded: {text}")

Model-Specific Tokenizer

from ragas.tokenizers import get_tokenizer

# Get tiktoken for GPT-4
tokenizer = get_tokenizer("tiktoken", model_name="gpt-4")
count = tokenizer.count_tokens("This is a test sentence.")
print(f"GPT-4 token count: {count}")

HuggingFace Tokenizer

from ragas.tokenizers import get_tokenizer

# Get HuggingFace tokenizer
tokenizer = get_tokenizer("huggingface", model_name="meta-llama/Llama-2-7b")
tokens = tokenizer.encode("Ragas evaluation toolkit")
print(f"Llama token count: {len(tokens)}")

Using the Default Tokenizer

from ragas.tokenizers import DEFAULT_TOKENIZER

# Lazy initialization - no network call until first use
count = DEFAULT_TOKENIZER.count_tokens("Some text to tokenize")
print(f"Token count: {count}")

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment