Principle:Mit han lab Llm awq Prompt Template Configuration
Overview
System for mapping model architectures to their expected prompt formats to ensure correct instruction-following behavior during interactive chat.
Description
Different LLM families (LLaMA-2, LLaMA-3, Vicuna, Falcon, MPT, Qwen) expect different prompt formats with specific system messages, delimiters, and role tags. Using the wrong template causes degraded responses or nonsensical output. A prompt template factory detects the model variant from the model type and path, then returns the appropriate formatter that wraps user inputs in the correct template structure.
For example, LLaMA-2 chat models expect prompts wrapped in [INST] and [/INST] delimiters with a <<SYS>> block for the system message, while Vicuna models use a simple USER: / ASSISTANT: format. LLaMA-3 introduces <|start_header_id|> and <|end_header_id|> tags. Each family also defines different stop tokens that signal the end of a generated response.
The prompt template factory pattern centralizes this logic so that the rest of the chat pipeline (tokenization, generation, streaming) remains model-agnostic. The factory inspects the model_type and model_path strings to determine which prompter subclass to instantiate.
Usage
When deploying any LLM for interactive chat. Must be configured before the generation loop. The prompt template is selected once during initialization and then used for every turn in the conversation:
- Detect the model variant from model_type and model_path
- Instantiate the corresponding prompter subclass
- Use insert_prompt() to format each user input before tokenization
- Use update_template() for multi-turn conversation state management
Related Pages
Knowledge Sources
- Repo|llm-awq|https://github.com/mit-han-lab/llm-awq
Domains
- NLP
- Deployment