Implementation:Vllm project Vllm Chat Template Application
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Natural Language Processing, Prompt Engineering |
| Last Updated | 2026-02-08 13:00 GMT |
Overview
Concrete tool for converting structured chat messages into model-specific prompt format provided by vLLM (wrapping the Hugging Face transformers chat template system).
Description
vLLM provides two paths for applying chat templates:
LLM.chat()(recommended): A high-level method that accepts a list of message dictionaries, internally applies the chat template via the renderer subsystem, and then callsgenerate(). This is the simplest path for chat-style inference.tokenizer.apply_chat_template()(manual): The underlying Hugging Face transformers method that converts messages into a formatted string. Users can call this directly for more control, then pass the resulting string toLLM.generate().
The LLM.chat() method handles the full pipeline: it resolves the chat template (from the model's tokenizer config or a user-supplied override), renders the messages using the Jinja2 template engine, appends the generation prompt if requested, and passes the result to the generation engine.
Usage
Use LLM.chat() when your input is already structured as message dictionaries. Use tokenizer.apply_chat_template() directly when you need to inspect or modify the formatted prompt before generation, or when integrating with a custom pipeline.
Code Reference
Source Location
- Repository: vllm
- File: vllm/entrypoints/llm.py
- Lines: 887-981 (LLM.chat method)
Signature
# High-level: LLM.chat()
def chat(
self,
messages: list[ChatCompletionMessageParam]
| Sequence[list[ChatCompletionMessageParam]],
sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
use_tqdm: bool | Callable[..., tqdm] = True,
lora_request: LoRARequest | None = None,
chat_template: str | None = None,
chat_template_content_format: ChatTemplateContentFormatOption = "auto",
add_generation_prompt: bool = True,
continue_final_message: bool = False,
tools: list[dict[str, Any]] | None = None,
chat_template_kwargs: dict[str, Any] | None = None,
tokenization_kwargs: dict[str, Any] | None = None,
mm_processor_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]
# Low-level: Hugging Face tokenizer
tokenizer.apply_chat_template(
conversation,
tokenize=False,
add_generation_prompt=True,
)
Import
from vllm import LLM
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| messages | list[dict] or list[list[dict]] | Yes | Chat messages with "role" and "content" keys. A single conversation or a batch of conversations |
| sampling_params | SamplingParams or None | No (default: None) | Sampling configuration for generation. Defaults to model defaults |
| chat_template | str or None | No (default: None) | Custom Jinja2 chat template. Defaults to the model's built-in template |
| add_generation_prompt | bool | No (default: True) | Whether to append the assistant's opening delimiter to prompt generation |
| continue_final_message | bool | No (default: False) | If True, continues the last message instead of starting a new assistant turn |
| tools | list[dict] or None | No (default: None) | Tool definitions for function-calling models |
| chat_template_content_format | str | No (default: "auto") | Content format: "string" or "openai" |
| use_tqdm | bool | No (default: True) | Whether to display a progress bar |
Outputs
| Name | Type | Description |
|---|---|---|
| results | list[RequestOutput] | Generated responses in the same order as input conversations |
Usage Examples
Single Conversation with LLM.chat()
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=256)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"},
]
outputs = llm.chat(messages, sampling_params=params)
print(outputs[0].outputs[0].text)
Batch Conversation
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=256)
conversations = [
[{"role": "user", "content": "Explain quantum computing."}],
[{"role": "user", "content": "What is machine learning?"}],
[{"role": "user", "content": "Describe the water cycle."}],
]
outputs = llm.chat(conversations, sampling_params=params)
for output in outputs:
print(output.outputs[0].text)
Manual Template Application
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
tokenizer = llm.get_tokenizer()
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"},
]
# Manually apply the chat template
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
# Pass the formatted prompt to generate()
params = SamplingParams(temperature=0.7, max_tokens=128)
outputs = llm.generate(prompt, sampling_params=params)
print(outputs[0].outputs[0].text)