Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Llama Chat Apply Template

From Leeroopedia
Aspect Detail
Implementation Name Llama Chat Apply Template
Doc Type API Doc
Category Conversation Formatting
Workflow Interactive_Chat
Applies To llama.cpp
Status Active

Overview

Description

The llama_chat_apply_template function formats an array of chat messages into a text string according to a model-specific chat template. The llama_model_chat_template function retrieves the template string embedded in a model's GGUF metadata. Together, these functions enable llama.cpp applications to correctly format multi-turn conversations for any supported model family. The llama_chat_message struct defines the simple role/content pair used to represent each message in the conversation.

Usage

These functions are called on every turn of the chat loop. First, llama_model_chat_template retrieves the template from the model. Then, llama_chat_apply_template is called to format the complete conversation history into a buffer. The incremental delta between the current and previous formatted lengths is extracted as the prompt for the next generation step.

Code Reference

Attribute Value
Source Location (apply_template) include/llama.h:1153-1159
Source Location (chat_template) include/llama.h:589
Source Location (chat_message struct) include/llama.h:408-411
Implementation src/llama-chat.cpp
Import #include "llama.h"

Signatures:

// The chat message struct
typedef struct llama_chat_message {
    const char * role;
    const char * content;
} llama_chat_message;

// Get the model's embedded chat template
// Returns nullptr if not available
// If name is NULL, returns the default chat template
const char * llama_model_chat_template(const struct llama_model * model, const char * name);

// Apply a chat template to format messages
// Returns the total number of bytes of the formatted prompt
// If larger than length, re-alloc buf and re-apply
int32_t llama_chat_apply_template(
    const char * tmpl,
    const struct llama_chat_message * chat,
    size_t n_msg,
    bool add_ass,
    char * buf,
    int32_t length);

// Get list of built-in chat templates
int32_t llama_chat_builtin_templates(const char ** output, size_t len);

I/O Contract

llama_model_chat_template:

Direction Name Type Description
Input model const struct llama_model * The loaded model
Input name const char * Template name, or NULL for the default
Output return const char * Template string, or nullptr if not available

llama_chat_apply_template:

Direction Name Type Description
Input tmpl const char * Template string (from model or user override)
Input chat const struct llama_chat_message * Array of chat messages
Input n_msg size_t Number of messages in the array
Input add_ass bool If true, append assistant turn prefix to the output
Input/Output buf char * Buffer to write the formatted output; can be NULL for length query
Input length int32_t Size of the buffer
Output return int32_t Total bytes of formatted output; negative on error

Preconditions:

  • The tmpl string must be a recognized template name or a template string containing recognizable markers
  • Messages must have valid role and content pointers
  • If buf is NULL or length is 0, the function returns the required buffer size without writing

Postconditions:

  • If the return value exceeds length, the buffer was too small; resize and re-call
  • If the return value is negative, template application failed
  • The formatted output is not null-terminated in the general case; use the returned length

Supported templates (partial list): chatml, llama2, llama3, llama4, mistral-v1, mistral-v3, mistral-v7, phi3, phi4, deepseek, deepseek2, deepseek3, gemma, command-r, falcon3, zephyr, vicuna, openchat, chatglm3, chatglm4, granite, exaone3, and many more (see src/llama-chat.cpp for the full registry).

Usage Examples

Complete incremental chat template pattern (from simple-chat):

std::vector<llama_chat_message> messages;
std::vector<char> formatted(llama_n_ctx(ctx));
int prev_len = 0;

while (true) {
    // Get user input
    std::string user;
    std::getline(std::cin, user);
    if (user.empty()) break;

    // Get the template from the model
    const char * tmpl = llama_model_chat_template(model, /* name */ nullptr);

    // Add user message and format
    messages.push_back({"user", strdup(user.c_str())});
    int new_len = llama_chat_apply_template(
        tmpl, messages.data(), messages.size(), true, formatted.data(), formatted.size());

    // Resize buffer if needed
    if (new_len > (int)formatted.size()) {
        formatted.resize(new_len);
        new_len = llama_chat_apply_template(
            tmpl, messages.data(), messages.size(), true, formatted.data(), formatted.size());
    }
    if (new_len < 0) {
        fprintf(stderr, "failed to apply the chat template\n");
        return 1;
    }

    // Extract only the new portion as the prompt
    std::string prompt(formatted.begin() + prev_len, formatted.begin() + new_len);

    // Generate response ...
    std::string response = generate(prompt);

    // Record the assistant response and update prev_len
    messages.push_back({"assistant", strdup(response.c_str())});
    prev_len = llama_chat_apply_template(
        tmpl, messages.data(), messages.size(), false, nullptr, 0);
}

Querying required buffer size:

// Pass NULL buffer to get required size
int required = llama_chat_apply_template(tmpl, messages.data(), messages.size(), true, nullptr, 0);
std::vector<char> buf(required);
llama_chat_apply_template(tmpl, messages.data(), messages.size(), true, buf.data(), buf.size());

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment