Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Romsto Speculative Decoding Tokenizer Apply Chat Template

From Leeroopedia
Revision as of 13:49, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Romsto_Speculative_Decoding_Tokenizer_Apply_Chat_Template.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains NLP, Preprocessing
Last Updated 2026-02-14 04:30 GMT

Overview

Wrapper documentation for HuggingFace tokenizer's apply_chat_template and __call__ methods as used in this repository for preparing inputs for generation.

Description

This repository uses a two-step tokenization pipeline from HuggingFace Transformers:

  1. tokenizer.apply_chat_template: Formats a conversation as a string using the model's chat template (e.g., Llama 3.2's special tokens and role markers). The add_generation_prompt=True flag appends the assistant turn prefix.
  2. tokenizer(): Converts the formatted string into token IDs (return_tensors="pt" produces a PyTorch tensor).

The output token IDs are converted to a Python list and passed directly to the generation functions (speculative_generate, ngram_assisted_speculative_generate, autoregressive_generate).

The reverse operation, tokenizer.decode, converts generated token IDs back to readable text, with skip_special_tokens=True to omit EOS/PAD markers.

External Reference

Usage

Use at the start of any generation workflow to convert user prompts into token IDs. Use apply_chat_template for instruction-tuned models (Llama-3.2-Instruct, etc.). Use tokenizer.decode after generation to convert output IDs back to text.

Code Reference

Source Location

  • Repository: Speculative-Decoding
  • File: infer.py (usage pattern)
  • Lines: L268-271 (tokenization), L296/L323 (decoding)

Signature

# HuggingFace API (external)
tokenizer.apply_chat_template(
    conversation: List[Dict[str, str]],
    add_generation_prompt: bool = False,
    tokenize: bool = True,
) -> Union[str, List[int]]

tokenizer(
    text: str,
    return_tensors: Optional[str] = None,
) -> BatchEncoding  # .input_ids gives token IDs

tokenizer.decode(
    token_ids: List[int],
    skip_special_tokens: bool = False,
) -> str

Import

from transformers import AutoTokenizer

I/O Contract

Inputs (apply_chat_template)

Name Type Required Description
conversation List[Dict] Yes List of {"role": "user"/"assistant"/"system", "content": str} dicts
add_generation_prompt bool No True to append assistant turn prefix (default: False)
tokenize bool No False to return string instead of token IDs (default: True)

Inputs (tokenizer __call__)

Name Type Required Description
text str Yes Text to tokenize
return_tensors str No "pt" for PyTorch tensors

Outputs

Name Type Description
apply_chat_template returns str or List[int] Formatted chat string (tokenize=False) or token IDs (tokenize=True)
tokenizer() returns BatchEncoding Contains .input_ids (token ID tensor) and .attention_mask
decode returns str Human-readable text from token IDs

Usage Examples

Full Tokenization Pipeline

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct")

# Step 1: Apply chat template
prompt = "What is speculative decoding?"
conversation = [{"role": "user", "content": prompt}]
formatted = tokenizer.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=False,
)

# Step 2: Tokenize to IDs
inputs = tokenizer(formatted, return_tensors="pt").input_ids[0].tolist()

# ... run generation ...

# Step 3: Decode output
output_text = tokenizer.decode(output_ids, skip_special_tokens=True)

Without Chat Template (Base Model)

# For non-instruction-tuned models, skip chat template
raw_prompt = "Once upon a time"
inputs = tokenizer(raw_prompt, return_tensors="pt").input_ids[0].tolist()

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment