Overview
The tokenizers module provides a unified tokenizer abstraction supporting both tiktoken (OpenAI) and HuggingFace tokenizer backends, with lazy initialization and a factory function for easy instantiation.
Description
This module defines a BaseTokenizer abstract class with encode, decode, and count_tokens methods, and two concrete implementations. TiktokenWrapper wraps OpenAI's tiktoken library and can be initialized from a pre-built encoding, a model name, or an encoding name (defaulting to o200k_base). HuggingFaceTokenizer wraps HuggingFace's AutoTokenizer with lazy import of the transformers package, raising a clear error if it is not installed. The module provides a lazy default tokenizer via the _LazyTokenizer class, which defers actual tokenizer creation until the first attribute access, avoiding network calls at import time. The module-level DEFAULT_TOKENIZER constant uses this lazy pattern for backwards compatibility. The get_tokenizer factory function creates the appropriate tokenizer based on a tokenizer_type parameter ("tiktoken" or "huggingface"), making it easy to switch backends.
Usage
Use this module whenever you need to count tokens, encode text to token IDs, or decode token IDs back to text within the Ragas framework. It is used internally by metrics that need token-level operations, such as text splitting or token-based scoring.
Code Reference
Source Location
Signature
class BaseTokenizer(ABC):
@abstractmethod
def encode(self, text: str) -> t.List[int]:
@abstractmethod
def decode(self, tokens: t.List[int]) -> str:
def count_tokens(self, text: str) -> int:
class TiktokenWrapper(BaseTokenizer):
def __init__(
self,
encoding: t.Optional[tiktoken.Encoding] = None,
model_name: t.Optional[str] = None,
encoding_name: t.Optional[str] = None,
):
class HuggingFaceTokenizer(BaseTokenizer):
def __init__(
self,
tokenizer: t.Optional[t.Any] = None,
model_name: t.Optional[str] = None,
):
def get_tokenizer(
tokenizer_type: str = "tiktoken",
model_name: t.Optional[str] = None,
encoding_name: t.Optional[str] = None,
) -> BaseTokenizer:
def get_default_tokenizer() -> TiktokenWrapper:
Import
from ragas.tokenizers import (
BaseTokenizer,
TiktokenWrapper,
HuggingFaceTokenizer,
get_tokenizer,
get_default_tokenizer,
DEFAULT_TOKENIZER,
)
I/O Contract
Inputs (TiktokenWrapper.__init__)
| Name |
Type |
Required |
Description
|
| encoding |
tiktoken.Encoding |
No |
A pre-initialized tiktoken encoding object
|
| model_name |
str |
No |
Model name to get encoding for (e.g., "gpt-4", "gpt-3.5-turbo")
|
| encoding_name |
str |
No |
Encoding name (e.g., "cl100k_base", "o200k_base"); if none provided, defaults to "o200k_base"
|
Inputs (HuggingFaceTokenizer.__init__)
| Name |
Type |
Required |
Description
|
| tokenizer |
Any |
No |
A pre-initialized HuggingFace tokenizer instance
|
| model_name |
str |
No |
Model name or path to load tokenizer from (e.g., "meta-llama/Llama-2-7b"); one of tokenizer or model_name must be provided
|
Inputs (get_tokenizer)
| Name |
Type |
Required |
Description
|
| tokenizer_type |
str |
No |
Type of tokenizer: "tiktoken" or "huggingface"; defaults to "tiktoken"
|
| model_name |
str |
No |
Model name for the tokenizer
|
| encoding_name |
str |
No |
Encoding name (only for tiktoken)
|
Outputs
| Name |
Type |
Description
|
| encode return |
List[int] |
List of token IDs representing the input text
|
| decode return |
str |
Text string decoded from the token IDs
|
| count_tokens return |
int |
Number of tokens in the input text
|
| get_tokenizer return |
BaseTokenizer |
A tokenizer instance of the requested type
|
| get_default_tokenizer return |
TiktokenWrapper |
The default tiktoken tokenizer using o200k_base encoding
|
Key Classes and Functions
| Name |
Description
|
| BaseTokenizer |
Abstract base class defining the encode/decode/count_tokens interface
|
| TiktokenWrapper |
Wrapper for OpenAI's tiktoken encodings with support for model-based and encoding-based initialization
|
| HuggingFaceTokenizer |
Wrapper for HuggingFace tokenizers with lazy import of the transformers package
|
| _LazyTokenizer |
Internal class that defers tokenizer creation until first attribute access, delegating all operations to get_default_tokenizer
|
| DEFAULT_TOKENIZER |
Module-level lazy tokenizer instance for backwards compatibility
|
| get_default_tokenizer |
Returns (and lazily creates) the default o200k_base tokenizer singleton
|
| get_tokenizer |
Factory function to create a tokenizer by type, model, and encoding
|
Usage Examples
Basic Usage
from ragas.tokenizers import get_tokenizer
# Get default tiktoken tokenizer
tokenizer = get_tokenizer()
# Encode text
tokens = tokenizer.encode("Hello, world!")
print(f"Token IDs: {tokens}")
print(f"Token count: {tokenizer.count_tokens('Hello, world!')}")
# Decode back
text = tokenizer.decode(tokens)
print(f"Decoded: {text}")
Model-Specific Tokenizer
from ragas.tokenizers import get_tokenizer
# Get tiktoken for GPT-4
tokenizer = get_tokenizer("tiktoken", model_name="gpt-4")
count = tokenizer.count_tokens("This is a test sentence.")
print(f"GPT-4 token count: {count}")
HuggingFace Tokenizer
from ragas.tokenizers import get_tokenizer
# Get HuggingFace tokenizer
tokenizer = get_tokenizer("huggingface", model_name="meta-llama/Llama-2-7b")
tokens = tokenizer.encode("Ragas evaluation toolkit")
print(f"Llama token count: {len(tokens)}")
Using the Default Tokenizer
from ragas.tokenizers import DEFAULT_TOKENIZER
# Lazy initialization - no network call until first use
count = DEFAULT_TOKENIZER.count_tokens("Some text to tokenize")
print(f"Token count: {count}")
Related Pages