Implementation:Ggml org Llama cpp Llama Chat Apply Template
| Aspect | Detail |
|---|---|
| Implementation Name | Llama Chat Apply Template |
| Doc Type | API Doc |
| Category | Conversation Formatting |
| Workflow | Interactive_Chat |
| Applies To | llama.cpp |
| Status | Active |
Overview
Description
The llama_chat_apply_template function formats an array of chat messages into a text string according to a model-specific chat template. The llama_model_chat_template function retrieves the template string embedded in a model's GGUF metadata. Together, these functions enable llama.cpp applications to correctly format multi-turn conversations for any supported model family. The llama_chat_message struct defines the simple role/content pair used to represent each message in the conversation.
Usage
These functions are called on every turn of the chat loop. First, llama_model_chat_template retrieves the template from the model. Then, llama_chat_apply_template is called to format the complete conversation history into a buffer. The incremental delta between the current and previous formatted lengths is extracted as the prompt for the next generation step.
Code Reference
| Attribute | Value |
|---|---|
| Source Location (apply_template) | include/llama.h:1153-1159
|
| Source Location (chat_template) | include/llama.h:589
|
| Source Location (chat_message struct) | include/llama.h:408-411
|
| Implementation | src/llama-chat.cpp
|
| Import | #include "llama.h"
|
Signatures:
// The chat message struct
typedef struct llama_chat_message {
const char * role;
const char * content;
} llama_chat_message;
// Get the model's embedded chat template
// Returns nullptr if not available
// If name is NULL, returns the default chat template
const char * llama_model_chat_template(const struct llama_model * model, const char * name);
// Apply a chat template to format messages
// Returns the total number of bytes of the formatted prompt
// If larger than length, re-alloc buf and re-apply
int32_t llama_chat_apply_template(
const char * tmpl,
const struct llama_chat_message * chat,
size_t n_msg,
bool add_ass,
char * buf,
int32_t length);
// Get list of built-in chat templates
int32_t llama_chat_builtin_templates(const char ** output, size_t len);
I/O Contract
llama_model_chat_template:
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | model | const struct llama_model * |
The loaded model |
| Input | name | const char * |
Template name, or NULL for the default |
| Output | return | const char * |
Template string, or nullptr if not available |
llama_chat_apply_template:
| Direction | Name | Type | Description |
|---|---|---|---|
| Input | tmpl | const char * |
Template string (from model or user override) |
| Input | chat | const struct llama_chat_message * |
Array of chat messages |
| Input | n_msg | size_t |
Number of messages in the array |
| Input | add_ass | bool |
If true, append assistant turn prefix to the output |
| Input/Output | buf | char * |
Buffer to write the formatted output; can be NULL for length query |
| Input | length | int32_t |
Size of the buffer |
| Output | return | int32_t |
Total bytes of formatted output; negative on error |
Preconditions:
- The
tmplstring must be a recognized template name or a template string containing recognizable markers - Messages must have valid
roleandcontentpointers - If
bufis NULL orlengthis 0, the function returns the required buffer size without writing
Postconditions:
- If the return value exceeds
length, the buffer was too small; resize and re-call - If the return value is negative, template application failed
- The formatted output is not null-terminated in the general case; use the returned length
Supported templates (partial list):
chatml, llama2, llama3, llama4, mistral-v1, mistral-v3, mistral-v7, phi3, phi4, deepseek, deepseek2, deepseek3, gemma, command-r, falcon3, zephyr, vicuna, openchat, chatglm3, chatglm4, granite, exaone3, and many more (see src/llama-chat.cpp for the full registry).
Usage Examples
Complete incremental chat template pattern (from simple-chat):
std::vector<llama_chat_message> messages;
std::vector<char> formatted(llama_n_ctx(ctx));
int prev_len = 0;
while (true) {
// Get user input
std::string user;
std::getline(std::cin, user);
if (user.empty()) break;
// Get the template from the model
const char * tmpl = llama_model_chat_template(model, /* name */ nullptr);
// Add user message and format
messages.push_back({"user", strdup(user.c_str())});
int new_len = llama_chat_apply_template(
tmpl, messages.data(), messages.size(), true, formatted.data(), formatted.size());
// Resize buffer if needed
if (new_len > (int)formatted.size()) {
formatted.resize(new_len);
new_len = llama_chat_apply_template(
tmpl, messages.data(), messages.size(), true, formatted.data(), formatted.size());
}
if (new_len < 0) {
fprintf(stderr, "failed to apply the chat template\n");
return 1;
}
// Extract only the new portion as the prompt
std::string prompt(formatted.begin() + prev_len, formatted.begin() + new_len);
// Generate response ...
std::string response = generate(prompt);
// Record the assistant response and update prev_len
messages.push_back({"assistant", strdup(response.c_str())});
prev_len = llama_chat_apply_template(
tmpl, messages.data(), messages.size(), false, nullptr, 0);
}
Querying required buffer size:
// Pass NULL buffer to get required size
int required = llama_chat_apply_template(tmpl, messages.data(), messages.size(), true, nullptr, 0);
std::vector<char> buf(required);
llama_chat_apply_template(tmpl, messages.data(), messages.size(), true, buf.data(), buf.size());