Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Vllm project Vllm Chat Template Application

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Natural Language Processing, Prompt Engineering
Last Updated 2026-02-08 13:00 GMT

Overview

Concrete tool for converting structured chat messages into model-specific prompt format provided by vLLM (wrapping the Hugging Face transformers chat template system).

Description

vLLM provides two paths for applying chat templates:

  1. LLM.chat() (recommended): A high-level method that accepts a list of message dictionaries, internally applies the chat template via the renderer subsystem, and then calls generate(). This is the simplest path for chat-style inference.
  2. tokenizer.apply_chat_template() (manual): The underlying Hugging Face transformers method that converts messages into a formatted string. Users can call this directly for more control, then pass the resulting string to LLM.generate().

The LLM.chat() method handles the full pipeline: it resolves the chat template (from the model's tokenizer config or a user-supplied override), renders the messages using the Jinja2 template engine, appends the generation prompt if requested, and passes the result to the generation engine.

Usage

Use LLM.chat() when your input is already structured as message dictionaries. Use tokenizer.apply_chat_template() directly when you need to inspect or modify the formatted prompt before generation, or when integrating with a custom pipeline.

Code Reference

Source Location

  • Repository: vllm
  • File: vllm/entrypoints/llm.py
  • Lines: 887-981 (LLM.chat method)

Signature

# High-level: LLM.chat()
def chat(
    self,
    messages: list[ChatCompletionMessageParam]
    | Sequence[list[ChatCompletionMessageParam]],
    sampling_params: SamplingParams | Sequence[SamplingParams] | None = None,
    use_tqdm: bool | Callable[..., tqdm] = True,
    lora_request: LoRARequest | None = None,
    chat_template: str | None = None,
    chat_template_content_format: ChatTemplateContentFormatOption = "auto",
    add_generation_prompt: bool = True,
    continue_final_message: bool = False,
    tools: list[dict[str, Any]] | None = None,
    chat_template_kwargs: dict[str, Any] | None = None,
    tokenization_kwargs: dict[str, Any] | None = None,
    mm_processor_kwargs: dict[str, Any] | None = None,
) -> list[RequestOutput]

# Low-level: Hugging Face tokenizer
tokenizer.apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True,
)

Import

from vllm import LLM

I/O Contract

Inputs

Name Type Required Description
messages list[dict] or list[list[dict]] Yes Chat messages with "role" and "content" keys. A single conversation or a batch of conversations
sampling_params SamplingParams or None No (default: None) Sampling configuration for generation. Defaults to model defaults
chat_template str or None No (default: None) Custom Jinja2 chat template. Defaults to the model's built-in template
add_generation_prompt bool No (default: True) Whether to append the assistant's opening delimiter to prompt generation
continue_final_message bool No (default: False) If True, continues the last message instead of starting a new assistant turn
tools list[dict] or None No (default: None) Tool definitions for function-calling models
chat_template_content_format str No (default: "auto") Content format: "string" or "openai"
use_tqdm bool No (default: True) Whether to display a progress bar

Outputs

Name Type Description
results list[RequestOutput] Generated responses in the same order as input conversations

Usage Examples

Single Conversation with LLM.chat()

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=256)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the capital of France?"},
]

outputs = llm.chat(messages, sampling_params=params)
print(outputs[0].outputs[0].text)

Batch Conversation

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
params = SamplingParams(temperature=0.7, max_tokens=256)

conversations = [
    [{"role": "user", "content": "Explain quantum computing."}],
    [{"role": "user", "content": "What is machine learning?"}],
    [{"role": "user", "content": "Describe the water cycle."}],
]

outputs = llm.chat(conversations, sampling_params=params)
for output in outputs:
    print(output.outputs[0].text)

Manual Template Application

from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")
tokenizer = llm.get_tokenizer()

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
]

# Manually apply the chat template
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

# Pass the formatted prompt to generate()
params = SamplingParams(temperature=0.7, max_tokens=128)
outputs = llm.generate(prompt, sampling_params=params)
print(outputs[0].outputs[0].text)

Related Pages

Implements Principle

Requires Environment

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment