Implementation:Romsto Speculative Decoding Tokenizer Apply Chat Template
| Knowledge Sources | |
|---|---|
| Domains | NLP, Preprocessing |
| Last Updated | 2026-02-14 04:30 GMT |
Overview
Wrapper documentation for HuggingFace tokenizer's apply_chat_template and __call__ methods as used in this repository for preparing inputs for generation.
Description
This repository uses a two-step tokenization pipeline from HuggingFace Transformers:
- tokenizer.apply_chat_template: Formats a conversation as a string using the model's chat template (e.g., Llama 3.2's special tokens and role markers). The add_generation_prompt=True flag appends the assistant turn prefix.
- tokenizer(): Converts the formatted string into token IDs (return_tensors="pt" produces a PyTorch tensor).
The output token IDs are converted to a Python list and passed directly to the generation functions (speculative_generate, ngram_assisted_speculative_generate, autoregressive_generate).
The reverse operation, tokenizer.decode, converts generated token IDs back to readable text, with skip_special_tokens=True to omit EOS/PAD markers.
External Reference
Usage
Use at the start of any generation workflow to convert user prompts into token IDs. Use apply_chat_template for instruction-tuned models (Llama-3.2-Instruct, etc.). Use tokenizer.decode after generation to convert output IDs back to text.
Code Reference
Source Location
- Repository: Speculative-Decoding
- File: infer.py (usage pattern)
- Lines: L268-271 (tokenization), L296/L323 (decoding)
Signature
# HuggingFace API (external)
tokenizer.apply_chat_template(
conversation: List[Dict[str, str]],
add_generation_prompt: bool = False,
tokenize: bool = True,
) -> Union[str, List[int]]
tokenizer(
text: str,
return_tensors: Optional[str] = None,
) -> BatchEncoding # .input_ids gives token IDs
tokenizer.decode(
token_ids: List[int],
skip_special_tokens: bool = False,
) -> str
Import
from transformers import AutoTokenizer
I/O Contract
Inputs (apply_chat_template)
| Name | Type | Required | Description |
|---|---|---|---|
| conversation | List[Dict] | Yes | List of {"role": "user"/"assistant"/"system", "content": str} dicts |
| add_generation_prompt | bool | No | True to append assistant turn prefix (default: False) |
| tokenize | bool | No | False to return string instead of token IDs (default: True) |
Inputs (tokenizer __call__)
| Name | Type | Required | Description |
|---|---|---|---|
| text | str | Yes | Text to tokenize |
| return_tensors | str | No | "pt" for PyTorch tensors |
Outputs
| Name | Type | Description |
|---|---|---|
| apply_chat_template returns | str or List[int] | Formatted chat string (tokenize=False) or token IDs (tokenize=True) |
| tokenizer() returns | BatchEncoding | Contains .input_ids (token ID tensor) and .attention_mask |
| decode returns | str | Human-readable text from token IDs |
Usage Examples
Full Tokenization Pipeline
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")
# Step 1: Apply chat template
prompt = "What is speculative decoding?"
conversation = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=False,
)
# Step 2: Tokenize to IDs
inputs = tokenizer(formatted, return_tensors="pt").input_ids[0].tolist()
# ... run generation ...
# Step 3: Decode output
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)
Without Chat Template (Base Model)
# For non-instruction-tuned models, skip chat template
raw_prompt = "Once upon a time"
inputs = tokenizer(raw_prompt, return_tensors="pt").input_ids[0].tolist()