Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Ggml org Llama cpp Interactive Chat

From Leeroopedia
Knowledge Sources
Domains LLMs, Inference, Chat
Last Updated 2026-02-14 22:00 GMT

Overview

End-to-end process for running multi-turn conversational inference with chat template support using a GGUF instruction-tuned model.

Description

This workflow extends basic text generation with conversation management, enabling multi-turn dialogue with an instruction-tuned language model. It maintains a message history (user and assistant turns), applies the model's chat template to format messages with the correct special tokens and role markers, and generates responses within the structured conversation format. The chat template is either embedded in the GGUF model metadata or can be supplied externally. The system handles context window management, detecting when the conversation approaches the context limit.

Usage

Execute this workflow when you need interactive multi-turn conversation with an instruction-tuned or chat-fine-tuned model. This is appropriate for chatbot applications, question-answering systems, or any scenario requiring back-and-forth dialogue where conversation history must be preserved across turns.

Execution Steps

Step 1: Load Model and Initialize Context

Load the GGUF model and create an inference context, similar to basic text generation. Additionally, retrieve the model's built-in chat template from its metadata, which defines how conversation messages are formatted with role tokens and delimiters.

Key considerations:

  • The model should be instruction-tuned or chat-fine-tuned for best results
  • Chat templates vary by model family (ChatML, Llama, Mistral, etc.)
  • A custom template can override the model's embedded template
  • Context size should be large enough for the expected conversation length

Step 2: Initialize Sampler

Create a sampler chain with appropriate parameters for conversational generation. Chat typically uses temperature-based sampling with top-p and min-p filtering for diverse but coherent responses, unlike the greedy sampling used in basic completion.

Key considerations:

  • Temperature controls randomness (0.0 = deterministic, 1.0+ = creative)
  • Top-p (nucleus sampling) and min-p filter low-probability tokens
  • Repetition penalty helps avoid loops in longer conversations
  • The sampler chain order matters: temperature is applied before top-p

Step 3: Get User Input

Read the user's message from the input source (terminal, pipe, or application). The message is stored as a structured chat message with the "user" role in the conversation history.

Key considerations:

  • Handle multi-line input for complex prompts
  • Detect special commands (exit, quit) for graceful termination
  • System messages can be added to the history for personality or instruction injection

Step 4: Apply Chat Template

Format the entire conversation history (all user and assistant messages) through the chat template engine. This produces a single formatted string with the correct special tokens, role markers, and delimiters that the model expects. Only the new (unprocessed) portion of the formatted string is extracted for tokenization.

Key considerations:

  • The template engine supports Jinja2-style templates
  • Each model family expects different formatting (e.g., ChatML uses im_start/im_end tokens)
  • The template adds a generation prompt after the last user message to trigger assistant response
  • Only the delta (new content since last turn) is tokenized, not the entire conversation

Step 5: Generate Assistant Response

Tokenize the new formatted content, add it to a batch, decode through the model, and sample tokens until an end-of-generation token is produced. The generated response is accumulated, displayed in streaming fashion, and then stored in the conversation history as an assistant message.

Key considerations:

  • End-of-generation detection uses model-specific EOS tokens
  • Response length can be bounded by a maximum token count
  • The response is streamed token-by-token for responsive user experience
  • Context window usage is checked after each response

Step 6: Manage Context Window

Check whether the conversation has consumed most of the available context window. If the context is nearly full, handle the situation by either truncating early conversation history, resetting the context with a summary, or informing the user that the conversation limit is reached.

Key considerations:

  • KV cache position tracking determines remaining capacity
  • Context overflow can cause degraded output quality
  • Advanced strategies include sliding window attention or context shifting
  • The conversation loop returns to Step 3 for the next user turn

Execution Diagram

GitHub URL

Workflow Repository