Implementation:Vibrantlabsai Ragas Tokenizers

Knowledge Sources	Vibrantlabsai_Ragas
Domains	Tokenization, NLP, Evaluation
Last Updated	2026-02-12 00:00 GMT

Overview

The tokenizers module provides a unified tokenizer abstraction supporting both tiktoken (OpenAI) and HuggingFace tokenizer backends, with lazy initialization and a factory function for easy instantiation.

Description

This module defines a BaseTokenizer abstract class with encode, decode, and count_tokens methods, and two concrete implementations. TiktokenWrapper wraps OpenAI's tiktoken library and can be initialized from a pre-built encoding, a model name, or an encoding name (defaulting to o200k_base). HuggingFaceTokenizer wraps HuggingFace's AutoTokenizer with lazy import of the transformers package, raising a clear error if it is not installed. The module provides a lazy default tokenizer via the _LazyTokenizer class, which defers actual tokenizer creation until the first attribute access, avoiding network calls at import time. The module-level DEFAULT_TOKENIZER constant uses this lazy pattern for backwards compatibility. The get_tokenizer factory function creates the appropriate tokenizer based on a tokenizer_type parameter ("tiktoken" or "huggingface"), making it easy to switch backends.

Usage

Use this module whenever you need to count tokens, encode text to token IDs, or decode token IDs back to text within the Ragas framework. It is used internally by metrics that need token-level operations, such as text splitting or token-based scoring.

Code Reference

Source Location

Repository: Vibrantlabsai_Ragas
File: src/ragas/tokenizers.py

Signature

class BaseTokenizer(ABC):
    @abstractmethod
    def encode(self, text: str) -> t.List[int]:
    @abstractmethod
    def decode(self, tokens: t.List[int]) -> str:
    def count_tokens(self, text: str) -> int:

class TiktokenWrapper(BaseTokenizer):
    def __init__(
        self,
        encoding: t.Optional[tiktoken.Encoding] = None,
        model_name: t.Optional[str] = None,
        encoding_name: t.Optional[str] = None,
    ):

class HuggingFaceTokenizer(BaseTokenizer):
    def __init__(
        self,
        tokenizer: t.Optional[t.Any] = None,
        model_name: t.Optional[str] = None,
    ):

def get_tokenizer(
    tokenizer_type: str = "tiktoken",
    model_name: t.Optional[str] = None,
    encoding_name: t.Optional[str] = None,
) -> BaseTokenizer:

def get_default_tokenizer() -> TiktokenWrapper:

Import

from ragas.tokenizers import (
    BaseTokenizer,
    TiktokenWrapper,
    HuggingFaceTokenizer,
    get_tokenizer,
    get_default_tokenizer,
    DEFAULT_TOKENIZER,
)

I/O Contract

Inputs (TiktokenWrapper.init)

Name	Type	Required	Description
encoding	tiktoken.Encoding	No	A pre-initialized tiktoken encoding object
model_name	str	No	Model name to get encoding for (e.g., "gpt-4", "gpt-3.5-turbo")
encoding_name	str	No	Encoding name (e.g., "cl100k_base", "o200k_base"); if none provided, defaults to "o200k_base"

Inputs (HuggingFaceTokenizer.init)

Name	Type	Required	Description
tokenizer	Any	No	A pre-initialized HuggingFace tokenizer instance
model_name	str	No	Model name or path to load tokenizer from (e.g., "meta-llama/Llama-2-7b"); one of tokenizer or model_name must be provided

Inputs (get_tokenizer)

Name	Type	Required	Description
tokenizer_type	str	No	Type of tokenizer: "tiktoken" or "huggingface"; defaults to "tiktoken"
model_name	str	No	Model name for the tokenizer
encoding_name	str	No	Encoding name (only for tiktoken)

Outputs

Name	Type	Description
encode return	List[int]	List of token IDs representing the input text
decode return	str	Text string decoded from the token IDs
count_tokens return	int	Number of tokens in the input text
get_tokenizer return	BaseTokenizer	A tokenizer instance of the requested type
get_default_tokenizer return	TiktokenWrapper	The default tiktoken tokenizer using o200k_base encoding

Key Classes and Functions

Name	Description
BaseTokenizer	Abstract base class defining the encode/decode/count_tokens interface
TiktokenWrapper	Wrapper for OpenAI's tiktoken encodings with support for model-based and encoding-based initialization
HuggingFaceTokenizer	Wrapper for HuggingFace tokenizers with lazy import of the transformers package
_LazyTokenizer	Internal class that defers tokenizer creation until first attribute access, delegating all operations to get_default_tokenizer
DEFAULT_TOKENIZER	Module-level lazy tokenizer instance for backwards compatibility
get_default_tokenizer	Returns (and lazily creates) the default o200k_base tokenizer singleton
get_tokenizer	Factory function to create a tokenizer by type, model, and encoding

Usage Examples

Basic Usage

from ragas.tokenizers import get_tokenizer

# Get default tiktoken tokenizer
tokenizer = get_tokenizer()

# Encode text
tokens = tokenizer.encode("Hello, world!")
print(f"Token IDs: {tokens}")
print(f"Token count: {tokenizer.count_tokens('Hello, world!')}")

# Decode back
text = tokenizer.decode(tokens)
print(f"Decoded: {text}")

Model-Specific Tokenizer

from ragas.tokenizers import get_tokenizer

# Get tiktoken for GPT-4
tokenizer = get_tokenizer("tiktoken", model_name="gpt-4")
count = tokenizer.count_tokens("This is a test sentence.")
print(f"GPT-4 token count: {count}")

HuggingFace Tokenizer

from ragas.tokenizers import get_tokenizer

# Get HuggingFace tokenizer
tokenizer = get_tokenizer("huggingface", model_name="meta-llama/Llama-2-7b")
tokens = tokenizer.encode("Ragas evaluation toolkit")
print(f"Llama token count: {len(tokens)}")

Using the Default Tokenizer

from ragas.tokenizers import DEFAULT_TOKENIZER

# Lazy initialization - no network call until first use
count = DEFAULT_TOKENIZER.count_tokens("Some text to tokenize")
print(f"Token count: {count}")

Related Pages

Environment:Vibrantlabsai_Ragas_Python_3_9_Core_Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment