Principle:Ggml org Llama cpp Chat Template Application

Aspect	Detail
Principle Name	Chat Template Application
Category	Conversation Formatting
Workflow	Interactive_Chat
Applies To	llama.cpp
Status	Active

Overview

Description

Chat Template Application is the principle of formatting multi-turn conversations into the specific text format that a language model was trained to understand. Different model families use different chat template formats -- ChatML, Llama, Mistral, DeepSeek, Phi, and many others -- each with distinct special tokens and structural conventions for marking role boundaries (system, user, assistant) within a conversation. Applying the correct template is essential for the model to properly distinguish between user messages and assistant responses, and to know when it should begin generating a new response.

This is a key distinguishing feature of interactive chat versus raw text generation. Without proper template application, the model receives unstructured text and cannot maintain proper turn-taking behavior.

Usage

Chat template application occurs at two points in each turn of the conversation loop:

Before generation: The full conversation history (including the new user message) is formatted with add_ass = true, which appends the assistant turn prefix. The incremental portion (everything after the previously formatted text) becomes the prompt for the generation call.
After generation: The full conversation history (now including the assistant response) is formatted with add_ass = false to record the total formatted length, establishing the baseline for the next incremental extraction.

This incremental approach avoids re-tokenizing the entire conversation history on every turn, since only the new portion needs to be processed.

Theoretical Basis

Template formats: Each model family defines a chat template that wraps messages with special tokens. For example:

ChatML (used by many models): <|im_start|>user\nmessage<|im_end|>\n<|im_start|>assistant\n
Llama 3: <|start_header_id|>user<|end_header_id|>\n\nmessage<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n
Mistral v3: [INST] message [/INST]

These templates serve multiple purposes:

Role identification: Special tokens mark where each participant's turn begins and ends, allowing the model to distinguish between user input and its own prior output.
Generation triggering: The add_ass parameter appends the beginning of an assistant turn, signaling the model to start generating a response.
End-of-generation detection: The model produces an end-of-generation (EOG) token when it has finished its response, which corresponds to the closing token of the template format (e.g., <|im_end|> for ChatML).

Template detection: llama.cpp supports two methods for obtaining the template:

Model-embedded template: Retrieved via llama_model_chat_template(model, name), which reads the template string from the model's GGUF metadata.
Explicit override: The caller can supply a template name string directly.

The library maintains a registry of over 40 known template formats in src/llama-chat.cpp. When a template string is provided, it first checks the registry for an exact name match, then falls back to heuristic detection based on the presence of characteristic tokens in the template string.

Incremental formatting: The conversation is stored as a growing vector of llama_chat_message structs. Each time the template is applied, the entire conversation is formatted, but only the difference between the new formatted length and the previous formatted length is extracted as the prompt. This delta represents exactly the new content that needs to be tokenized and fed to the model, avoiding redundant processing of earlier turns.

The llama_chat_message struct:

The conversation history is represented as an array of simple structs, each containing a role string (e.g., "user", "assistant", "system") and a content string. This flat structure is template-agnostic and can be formatted by any supported template.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment