Implementation:Ggml org Llama cpp Llama Chat Apply Template

Aspect	Detail
Implementation Name	Llama Chat Apply Template
Doc Type	API Doc
Category	Conversation Formatting
Workflow	Interactive_Chat
Applies To	llama.cpp
Status	Active

Overview

Description

The llama_chat_apply_template function formats an array of chat messages into a text string according to a model-specific chat template. The llama_model_chat_template function retrieves the template string embedded in a model's GGUF metadata. Together, these functions enable llama.cpp applications to correctly format multi-turn conversations for any supported model family. The llama_chat_message struct defines the simple role/content pair used to represent each message in the conversation.

Usage

These functions are called on every turn of the chat loop. First, llama_model_chat_template retrieves the template from the model. Then, llama_chat_apply_template is called to format the complete conversation history into a buffer. The incremental delta between the current and previous formatted lengths is extracted as the prompt for the next generation step.

Code Reference

Attribute	Value
Source Location (apply_template)	`include/llama.h:1153-1159`
Source Location (chat_template)	`include/llama.h:589`
Source Location (chat_message struct)	`include/llama.h:408-411`
Implementation	`src/llama-chat.cpp`
Import	`#include "llama.h"`

Signatures:

// The chat message struct
typedef struct llama_chat_message {
    const char * role;
    const char * content;
} llama_chat_message;

// Get the model's embedded chat template
// Returns nullptr if not available
// If name is NULL, returns the default chat template
const char * llama_model_chat_template(const struct llama_model * model, const char * name);

// Apply a chat template to format messages
// Returns the total number of bytes of the formatted prompt
// If larger than length, re-alloc buf and re-apply
int32_t llama_chat_apply_template(
    const char * tmpl,
    const struct llama_chat_message * chat,
    size_t n_msg,
    bool add_ass,
    char * buf,
    int32_t length);

// Get list of built-in chat templates
int32_t llama_chat_builtin_templates(const char ** output, size_t len);

I/O Contract

llama_model_chat_template:

Direction	Name	Type	Description
Input	model	`const struct llama_model *`	The loaded model
Input	name	`const char *`	Template name, or NULL for the default
Output	return	`const char *`	Template string, or nullptr if not available

llama_chat_apply_template:

Direction	Name	Type	Description
Input	tmpl	`const char *`	Template string (from model or user override)
Input	chat	`const struct llama_chat_message *`	Array of chat messages
Input	n_msg	`size_t`	Number of messages in the array
Input	add_ass	`bool`	If true, append assistant turn prefix to the output
Input/Output	buf	`char *`	Buffer to write the formatted output; can be NULL for length query
Input	length	`int32_t`	Size of the buffer
Output	return	`int32_t`	Total bytes of formatted output; negative on error

Preconditions:

The tmpl string must be a recognized template name or a template string containing recognizable markers
Messages must have valid role and content pointers
If buf is NULL or length is 0, the function returns the required buffer size without writing

Postconditions:

If the return value exceeds length, the buffer was too small; resize and re-call
If the return value is negative, template application failed
The formatted output is not null-terminated in the general case; use the returned length

Supported templates (partial list): chatml, llama2, llama3, llama4, mistral-v1, mistral-v3, mistral-v7, phi3, phi4, deepseek, deepseek2, deepseek3, gemma, command-r, falcon3, zephyr, vicuna, openchat, chatglm3, chatglm4, granite, exaone3, and many more (see src/llama-chat.cpp for the full registry).

Usage Examples

Complete incremental chat template pattern (from simple-chat):

std::vector<llama_chat_message> messages;
std::vector<char> formatted(llama_n_ctx(ctx));
int prev_len = 0;

while (true) {
    // Get user input
    std::string user;
    std::getline(std::cin, user);
    if (user.empty()) break;

    // Get the template from the model
    const char * tmpl = llama_model_chat_template(model, /* name */ nullptr);

    // Add user message and format
    messages.push_back({"user", strdup(user.c_str())});
    int new_len = llama_chat_apply_template(
        tmpl, messages.data(), messages.size(), true, formatted.data(), formatted.size());

    // Resize buffer if needed
    if (new_len > (int)formatted.size()) {
        formatted.resize(new_len);
        new_len = llama_chat_apply_template(
            tmpl, messages.data(), messages.size(), true, formatted.data(), formatted.size());
    }
    if (new_len < 0) {
        fprintf(stderr, "failed to apply the chat template\n");
        return 1;
    }

    // Extract only the new portion as the prompt
    std::string prompt(formatted.begin() + prev_len, formatted.begin() + new_len);

    // Generate response ...
    std::string response = generate(prompt);

    // Record the assistant response and update prev_len
    messages.push_back({"assistant", strdup(response.c_str())});
    prev_len = llama_chat_apply_template(
        tmpl, messages.data(), messages.size(), false, nullptr, 0);
}

Querying required buffer size:

// Pass NULL buffer to get required size
int required = llama_chat_apply_template(tmpl, messages.data(), messages.size(), true, nullptr, 0);
std::vector<char> buf(required);
llama_chat_apply_template(tmpl, messages.data(), messages.size(), true, buf.data(), buf.size());

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment