Implementation:Romsto Speculative Decoding Tokenizer Apply Chat Template

Knowledge Sources	HuggingFace Chat Templating Speculative Decoding
Domains	NLP, Preprocessing
Last Updated	2026-02-14 04:30 GMT

Overview

Wrapper documentation for HuggingFace tokenizer's apply_chat_template and __call__ methods as used in this repository for preparing inputs for generation.

Description

This repository uses a two-step tokenization pipeline from HuggingFace Transformers:

tokenizer.apply_chat_template: Formats a conversation as a string using the model's chat template (e.g., Llama 3.2's special tokens and role markers). The add_generation_prompt=True flag appends the assistant turn prefix.
tokenizer(): Converts the formatted string into token IDs (return_tensors="pt" produces a PyTorch tensor).

The output token IDs are converted to a Python list and passed directly to the generation functions (speculative_generate, ngram_assisted_speculative_generate, autoregressive_generate).

The reverse operation, tokenizer.decode, converts generated token IDs back to readable text, with skip_special_tokens=True to omit EOS/PAD markers.

External Reference

HuggingFace Chat Templating Guide

Usage

Use at the start of any generation workflow to convert user prompts into token IDs. Use apply_chat_template for instruction-tuned models (Llama-3.2-Instruct, etc.). Use tokenizer.decode after generation to convert output IDs back to text.

Code Reference

Source Location

Repository: Speculative-Decoding
File: infer.py (usage pattern)
Lines: L268-271 (tokenization), L296/L323 (decoding)

Signature

# HuggingFace API (external)
tokenizer.apply_chat_template(
    conversation: List[Dict[str, str]],
    add_generation_prompt: bool = False,
    tokenize: bool = True,
) -> Union[str, List[int]]

tokenizer(
    text: str,
    return_tensors: Optional[str] = None,
) -> BatchEncoding  # .input_ids gives token IDs

tokenizer.decode(
    token_ids: List[int],
    skip_special_tokens: bool = False,
) -> str

Import

from transformers import AutoTokenizer

I/O Contract

Inputs (apply_chat_template)

Name	Type	Required	Description
conversation	List[Dict]	Yes	List of {"role": "user"/"assistant"/"system", "content": str} dicts
add_generation_prompt	bool	No	True to append assistant turn prefix (default: False)
tokenize	bool	No	False to return string instead of token IDs (default: True)

Inputs (tokenizer call)

Name	Type	Required	Description
text	str	Yes	Text to tokenize
return_tensors	str	No	"pt" for PyTorch tensors

Outputs

Name	Type	Description
apply_chat_template returns	str or List[int]	Formatted chat string (tokenize=False) or token IDs (tokenize=True)
tokenizer() returns	BatchEncoding	Contains .input_ids (token ID tensor) and .attention_mask
decode returns	str	Human-readable text from token IDs

Usage Examples

Full Tokenization Pipeline

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Step 1: Apply chat template
prompt = "What is speculative decoding?"
conversation = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=False,
)

# Step 2: Tokenize to IDs
inputs = tokenizer(formatted, return_tensors="pt").input_ids[0].tolist()

# ... run generation ...

# Step 3: Decode output
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)

Without Chat Template (Base Model)

# For non-instruction-tuned models, skip chat template
raw_prompt = "Once upon a time"
inputs = tokenizer(raw_prompt, return_tensors="pt").input_ids[0].tolist()

Related Pages

Implements Principle

Principle:Romsto_Speculative_Decoding_Input_Tokenization

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment