Implementation:Vllm project Vllm Chat Template Application

Knowledge Sources	vLLM vLLM Docs Hugging Face Transformers
Domains	Machine Learning, Natural Language Processing, Prompt Engineering
Last Updated	2026-02-08 13:00 GMT

Overview

Concrete tool for converting structured chat messages into model-specific prompt format provided by vLLM (wrapping the Hugging Face transformers chat template system).

Description

vLLM provides two paths for applying chat templates:

LLM.chat() (recommended): A high-level method that accepts a list of message dictionaries, internally applies the chat template via the renderer subsystem, and then calls generate(). This is the simplest path for chat-style inference.
tokenizer.apply_chat_template() (manual): The underlying Hugging Face transformers method that converts messages into a formatted string. Users can call this directly for more control, then pass the resulting string to LLM.generate().

The LLM.chat() method handles the full pipeline: it resolves the chat template (from the model's tokenizer config or a user-supplied override), renders the messages using the Jinja2 template engine, appends the generation prompt if requested, and passes the result to the generation engine.

Usage

Use LLM.chat() when your input is already structured as message dictionaries. Use tokenizer.apply_chat_template() directly when you need to inspect or modify the formatted prompt before generation, or when integrating with a custom pipeline.

Code Reference

Source Location

Repository: vllm
File: vllm/entrypoints/llm.py
Lines: 887-981 (LLM.chat method)

Signature

# High-level: LLM.chat()
def chat(
    self,
    messages: list[ChatCompletionMessageParam]
    | Sequence[list[ChatCompletionMessageParam]],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: LoRARequest | None = None,
    chat_template: str | None = None,
    chat_template_content_format: ChatTemplateContentFormatOption = "auto",
    add_generation_prompt: bool = True,
    continue_final_message: bool = False,
    tools: list[dict[str, Any]] | None = None,
    chat_template_kwargs: dict[str, Any] | None = None,
    tokenization_kwargs: dict[str, Any] | None = None,
    mm_processor_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]

# Low-level: Hugging Face tokenizer
tokenizer.apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True,
)

Import

from vllm import LLM

I/O Contract

Inputs

Name	Type	Required	Description
messages	list[dict] or list[list[dict]]	Yes	Chat messages with "role" and "content" keys. A single conversation or a batch of conversations
sampling_params	SamplingParams or None	No (default: None)	Sampling configuration for generation. Defaults to model defaults
chat_template	str or None	No (default: None)	Custom Jinja2 chat template. Defaults to the model's built-in template
add_generation_prompt	bool	No (default: True)	Whether to append the assistant's opening delimiter to prompt generation
continue_final_message	bool	No (default: False)	If True, continues the last message instead of starting a new assistant turn
tools	list[dict] or None	No (default: None)	Tool definitions for function-calling models
chat_template_content_format	str	No (default: "auto")	Content format: "string" or "openai"
use_tqdm	bool	No (default: True)	Whether to display a progress bar

Outputs

Name	Type	Description
results	list[RequestOutput]	Generated responses in the same order as input conversations

Usage Examples

Single Conversation with LLM.chat()

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=256)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

outputs = llm.chat(messages, sampling_params=params)
print(outputs[0].outputs[0].text)

Batch Conversation

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=256)

conversations = [
    [{"role": "user", "content": "Explain quantum computing."}],
    [{"role": "user", "content": "What is machine learning?"}],
    [{"role": "user", "content": "Describe the water cycle."}],
]

outputs = llm.chat(conversations, sampling_params=params)
for output in outputs:
    print(output.outputs[0].text)

Manual Template Application

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
tokenizer = llm.get_tokenizer()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
]

# Manually apply the chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

# Pass the formatted prompt to generate()
params = SamplingParams(temperature=0.7, max_tokens=128)
outputs = llm.generate(prompt, sampling_params=params)
print(outputs[0].outputs[0].text)

Related Pages

Implements Principle

Principle:Vllm_project_Vllm_Prompt_Preparation

Requires Environment

Environment:Vllm_project_Vllm_Python_Dependencies

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment