Overview
Tokenizer is a lightweight wrapper around SentencePiece for Llama2 text tokenization. It provides encode() and decode() methods with optional BOS/EOS token injection, and exposes vocabulary metadata properties (vocab_size, bos_id, eos_id, pad_id). This class is adapted from Meta's official Llama repository and is used in the TorchServe tensor-parallel Llama serving pipeline.
Description
The Tokenizer class wraps a SentencePiece model to provide encoding and decoding for Llama2 text generation. It is a minimal, focused utility (44 lines) that loads a .model file at construction time and validates that vocab_size() equals get_piece_size(). The encode method optionally prepends BOS and appends EOS tokens to the integer ID sequence.
Key Responsibilities
- Model Loading: Initializes
SentencePieceProcessor from a model file path
- Vocabulary Metadata: Exposes
n_words (vocab size), bos_id, eos_id, and pad_id as instance attributes
- Encoding: Converts string to list of integer token IDs with optional BOS/EOS wrapping
- Decoding: Converts list of integer token IDs back to string
Usage
from llama2_tokenizer import Tokenizer
tokenizer = Tokenizer(model_path="/path/to/tokenizer.model")
# Encode with BOS and EOS
token_ids = tokenizer.encode("Hello, world!", bos=True, eos=True)
# Result: [1, ..., 2] (1 = BOS, 2 = EOS)
# Decode back to string
text = tokenizer.decode(token_ids)
# Access vocabulary metadata
print(tokenizer.n_words) # e.g. 32000
print(tokenizer.bos_id) # 1
print(tokenizer.eos_id) # 2
print(tokenizer.pad_id) # -1
Code Reference
Source Location
| File |
Lines |
Description
|
examples/large_models/tp_llama/llama2_tokenizer.py |
L1-44 |
Full module (44 lines)
|
examples/large_models/tp_llama/llama2_tokenizer.py |
L14-44 |
Tokenizer class definition
|
examples/large_models/tp_llama/llama2_tokenizer.py |
L15-33 |
__init__(model_path) -- load SentencePiece model and extract metadata
|
examples/large_models/tp_llama/llama2_tokenizer.py |
L35-41 |
encode(s, bos, eos) -- string to token ID list
|
examples/large_models/tp_llama/llama2_tokenizer.py |
L43-44 |
decode(t) -- token ID list to string
|
Signature
class Tokenizer:
def __init__(self, model_path: str):
"""
Load SentencePiece model and initialize vocabulary metadata.
Loads the .model file via SentencePieceProcessor, then
extracts vocab_size, bos_id, eos_id, and pad_id. Asserts
that vocab_size() == get_piece_size().
Args:
model_path (str): Path to the SentencePiece .model file.
"""
...
def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
"""
Encode a string into a list of token IDs.
Optionally prepends BOS token ID and appends EOS token ID.
Args:
s (str): Input string to encode.
bos (bool): Whether to prepend the BOS token.
eos (bool): Whether to append the EOS token.
Returns:
List[int]: List of integer token IDs.
"""
...
def decode(self, t: List[int]) -> str:
"""
Decode a list of token IDs back to a string.
Args:
t (List[int]): List of integer token IDs.
Returns:
str: Decoded text string.
"""
...
Import
# Module imports
from logging import getLogger
from typing import List
# Runtime import inside __init__:
from sentencepiece import SentencePieceProcessor
I/O Contract
| Method |
Input |
Output |
Notes
|
__init__(model_path) |
str -- path to .model file |
None (sets self.sp_model, self.n_words, self.bos_id, self.eos_id, self.pad_id) |
Asserts vocab_size() == get_piece_size()
|
encode(s, bos, eos) |
str, bool, bool |
List[int] -- token IDs |
BOS prepended if bos=True; EOS appended if eos=True
|
decode(t) |
List[int] -- token IDs |
str -- decoded text |
Delegates to sp_model.decode()
|
Usage Examples
Example 1: Initialization and Metadata
# From llama2_tokenizer.py L15-33
class Tokenizer:
def __init__(self, model_path: str):
from sentencepiece import SentencePieceProcessor
self.sp_model = SentencePieceProcessor(model_file=model_path)
# BOS / EOS token IDs
self.n_words: int = self.sp_model.vocab_size()
self.bos_id: int = self.sp_model.bos_id()
self.eos_id: int = self.sp_model.eos_id()
self.pad_id: int = self.sp_model.pad_id()
assert self.sp_model.vocab_size() == self.sp_model.get_piece_size()
Example 2: Encoding with BOS/EOS
# From llama2_tokenizer.py L35-41
def encode(self, s: str, bos: bool, eos: bool) -> List[int]:
t = self.sp_model.encode(s)
if bos:
t = [self.bos_id] + t
if eos:
t = t + [self.eos_id]
return t
# Usage:
tokenizer = Tokenizer("tokenizer.model")
ids_with_special = tokenizer.encode("Hello", bos=True, eos=True)
# [1, 15043, 2]
ids_without_special = tokenizer.encode("Hello", bos=False, eos=False)
# [15043]
Example 3: Decoding
# From llama2_tokenizer.py L43-44
def decode(self, t: List[int]) -> str:
return self.sp_model.decode(t)
# Usage:
text = tokenizer.decode([15043])
# "Hello"
Related Pages