Principle:Vllm project Vllm Prompt Preparation
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Prompt Engineering |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Prompt preparation is the process of converting structured conversation messages into the specific text format expected by a language model's chat template before generation.
Description
Modern instruction-tuned and chat-oriented language models are trained with specific formatting conventions that delineate system instructions, user messages, and assistant responses. These conventions vary significantly between model families (e.g., Llama uses [INST]...[/INST] markers, ChatML uses <|im_start|>...<|im_end|> tags, Mistral uses its own token-based format).
Prompt preparation bridges the gap between the application-level representation of a conversation (a list of role/content message dictionaries, following the OpenAI convention) and the raw text string that the model's tokenizer will process. The key components are:
- Chat template: A Jinja2 template stored in the tokenizer's configuration that defines how messages are formatted. Each model family ships its own template.
- Message structure: Each message is a dictionary with at minimum a
rolefield ("system", "user", "assistant") and acontentfield containing the text. - Generation prompt: An optional suffix appended after the final message to signal the model that it should begin generating a response. For example, Llama templates append
<|start_header_id|>assistant<|end_header_id|>. - Multi-modal content: For vision-language models, the content field may include image or video references alongside text.
Usage
Use prompt preparation whenever working with instruction-tuned or chat models. If your application manages conversations as structured message lists (the common pattern for chatbots and agents), apply the chat template before passing the formatted prompt to the generation engine. vLLM's LLM.chat() method handles this automatically; use explicit template application when you need more control over the formatting.
Theoretical Basis
Chat templates implement a structured-to-flat transformation. Formally, given a conversation C = [(r_1, m_1), (r_2, m_2), ..., (r_n, m_n)] where each r_i is a role and m_i is message content, the template function T produces:
prompt = T(C, add_generation_prompt=True)
The template T is a Jinja2 program that iterates over messages and wraps each one in model-specific delimiters. The add_generation_prompt flag appends the assistant's opening delimiter so the model knows to start generating.
Key properties of correct prompt preparation:
- Fidelity: The formatted prompt must exactly match the format used during model fine-tuning. Mismatched formatting degrades model quality.
- Special token handling: Some delimiters are special tokens (not regular text) and must be injected as token IDs rather than text strings. The tokenizer handles this distinction.
- Idempotency: Applying the template twice should not produce a doubly-wrapped prompt. The preparation step should be applied exactly once.
For multi-turn conversations, the template preserves the full conversation history, allowing the model to attend to all prior context during generation.