Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Turboderp org Exllamav2 Interactive Chat

From Leeroopedia
Revision as of 11:02, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Turboderp_org_Exllamav2_Interactive_Chat.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains LLMs, Inference, Chatbot
Last Updated 2026-02-15 00:00 GMT

Overview

End-to-end process for running an interactive multi-turn chat session with a quantized language model using ExLlamaV2's streaming generator and prompt format templates.

Description

This workflow implements a full-featured interactive chatbot that supports multi-turn conversations with context management, streaming token output, configurable prompt formats for different model families, optional speculative decoding (draft model or n-gram), and code block syntax highlighting. It uses the Streaming Generator for token-by-token output with real-time display, handles context overflow by truncating oldest conversation turns, and supports multiple sampling strategies including dynamic temperature and various repetition penalties.

Usage

Execute this workflow when you need an interactive terminal-based chat interface with a quantized language model. This is appropriate for testing model behavior, conversational AI experimentation, or as a reference implementation for building chat applications. The workflow supports all major instruct-tuned model families (Llama, Mistral, ChatML, Vicuna, etc.) through configurable prompt format templates.

Execution Steps

Step 1: Argument_Configuration

Parse command-line arguments for model path, GPU split, prompt format mode, sampling parameters (temperature, top-k, top-p, repetition penalties, DRY, XTC), cache quantization mode, speculative decoding options, and chat display preferences. The model initialization helper provides standard arguments for model directory, GPU allocation, context length, and RoPE scaling.

Key considerations:

  • The mode argument selects the prompt format template (llama, chatml, vicuna, raw, etc.)
  • Sampling parameters have sensible defaults but are fully configurable
  • Cache quantization (Q4, Q6, Q8, FP8) can reduce VRAM usage
  • Draft model path enables speculative decoding for faster generation

Step 2: Model_And_Tokenizer_Init

Initialize the model configuration, load the model with auto-split or manual GPU placement, and create the tokenizer. If a draft model is specified for speculative decoding, it is loaded first on the primary device. The main model is then loaded, and a KV-cache is created with the selected precision mode. Tensor-parallel mode uses a special TP cache wrapper.

Key considerations:

  • Draft model is loaded before the main model to reserve VRAM
  • Auto-split distributes layers across available GPUs by probing VRAM
  • Cache type selection (FP16, FP8, Q4, Q6, Q8) affects memory and quality
  • Model architecture compatibility overrides are applied automatically

Step 3: Prompt_Format_Selection

Select and configure the prompt format template based on the model family. Each format defines the conversation structure including system prompt placement, user/assistant turn delimiters, stop conditions, and encoding options (BOS token, special token encoding). A custom system prompt can override the format default, or system prompts can be disabled entirely.

Key considerations:

  • Over 20 prompt formats are supported (llama, chatml, gemma, mistral, phi3, etc.)
  • Each format specifies its own stop conditions (EOS tokens, delimiter strings)
  • The format controls whether BOS/EOS tokens are added during encoding
  • Custom system prompts can be injected into any format

Step 4: Generator_And_Sampler_Setup

Create the Streaming Generator with the model, cache, tokenizer, and optional draft model for speculative decoding. Configure the sampler settings object with all specified parameters: temperature, top-k, top-p, top-a, typical sampling, skew, repetition penalty, frequency/presence penalties, DRY anti-repetition, and XTC sampling. Set stop conditions from the prompt format.

Key considerations:

  • The Streaming Generator supports token-by-token output via stream_ex()
  • Speculative decoding can use either a draft model or n-gram prediction
  • Dynamic temperature adjusts temperature based on entropy of the output distribution
  • DRY (Dont Repeat Yourself) penalizes repeated n-grams with configurable parameters

Step 5: Conversation_Loop

Enter the main chat loop: accept user input, format it with the prompt template, tokenize the full conversation context, and stream the model response token by token. The context manager maintains conversation history as a list of user prompts and response token sequences. When the context exceeds the model's maximum sequence length minus a reserved response space, the oldest conversation turns are dropped to make room.

Key considerations:

  • Context is rebuilt from full conversation history each turn
  • Oldest turns are pruned when context exceeds limits
  • The system prompt is relocated to the beginning of the truncated context
  • Code blocks in responses receive syntax highlighting via a formatter
  • An amnesia mode forgets context after every response

Step 6: Streaming_Output

Stream the model's response to the terminal character by character as tokens are generated. Each iteration of the streaming loop produces a result containing the decoded text chunk, EOS status, and generated token IDs. Handle code block detection for syntax highlighting, context overflow mid-generation (rebuild context and restart stream), and maximum response length enforcement.

Key considerations:

  • Token-by-token streaming provides real-time output feedback
  • If the KV-cache fills during generation, context is rebuilt and streaming restarts
  • Response length is capped at a configurable maximum (default 1000 tokens)
  • Timing statistics and speculative decoding efficiency can be displayed after each turn

Execution Diagram

GitHub URL

Workflow Repository