Principle:Romsto Speculative Decoding Input Tokenization
| Knowledge Sources | |
|---|---|
| Domains | NLP, Preprocessing, Text_Processing |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
The process of converting raw text into a sequence of integer token IDs suitable for model input, including applying chat templates for instruction-tuned models.
Description
Input Tokenization converts human-readable text into the numerical representation that transformer models consume. For instruction-tuned models (like Llama-3.2-Instruct), this involves two stages:
- Chat template application: Wrapping the user's raw text in the model's expected conversation format (system/user/assistant role markers, special tokens). This is handled by the tokenizer's apply_chat_template method.
- Subword tokenization: Converting the formatted text string into a sequence of integer token IDs using the model's vocabulary (typically BPE or SentencePiece).
Correct tokenization is essential because:
- The chat template ensures the model understands the input as a conversation turn
- The add_generation_prompt=True flag appends the assistant turn prefix so the model knows to generate a response
- The resulting token IDs must match the vocabulary used during the model's pretraining
Usage
Use this principle before any generation function (speculative_generate, ngram_assisted_speculative_generate, autoregressive_generate). Apply the chat template when using instruction-tuned models. Skip the chat template (set chat=False in the CLI) when using base models or when providing pre-formatted prompts.
Theoretical Basis
The tokenization pipeline:
# Abstract tokenization pipeline
def prepare_input(raw_prompt, tokenizer, use_chat_template=True):
if use_chat_template:
# Wrap in conversation format with role markers
conversation = [{"role": "user", "content": raw_prompt}]
text = tokenizer.apply_chat_template(
conversation,
add_generation_prompt=True, # append assistant prefix
tokenize=False # return string, not IDs
)
else:
text = raw_prompt
# Convert to token IDs
token_ids = tokenizer(text, return_tensors="pt").input_ids[0].tolist()
return token_ids
The reverse operation (decoding) converts token IDs back to human-readable text:
# Abstract decoding
output_text = tokenizer.decode(token_ids, skip_special_tokens=True)