Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Turboderp org Exllamav2 Text Generation

From Leeroopedia
Revision as of 11:01, 16 February 2026 by Admin (talk | contribs) (Auto-imported from workflows/Turboderp_org_Exllamav2_Text_Generation.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains LLMs, Inference, Text_Generation
Last Updated 2026-02-15 00:00 GMT

Overview

End-to-end process for loading a quantized language model and generating text completions using the ExLlamaV2 Dynamic Generator with paged KV-cache and dynamic batching.

Description

This workflow demonstrates the standard procedure for performing text generation inference with ExLlamaV2. It covers loading a quantized model (EXL2 or GPTQ format), initializing the paged KV-cache, creating the Dynamic Generator, configuring sampling parameters, and generating single or batched completions. The Dynamic Generator is the recommended API that supports paged attention, cache deduplication, automatic prompt caching, and concurrent batched generation through a simplified interface.

Usage

Execute this workflow when you have a quantized language model (EXL2 or GPTQ format) and need to generate text completions programmatically. This applies to both single prompts and batch processing of multiple prompts. The Dynamic Generator automatically manages batching, cache allocation, and scheduling, making it suitable for both simple one-off generations and throughput-oriented workloads.

Execution Steps

Step 1: Model_Configuration

Initialize the model configuration by pointing it at the model directory. The configuration parser reads the HuggingFace config.json to auto-detect the model architecture (Llama, Mistral, Qwen2, Gemma, etc.) and sets architecture-specific parameters including attention head counts, hidden dimensions, vocabulary size, and RoPE settings. Optionally apply architecture compatibility overrides for models that require them.

Key considerations:

  • The model directory must contain config.json and weight files
  • Architecture is auto-detected from the config
  • Max sequence length can be overridden for extended context
  • RoPE alpha scaling can be configured for context extension

Step 2: Cache_Allocation

Create a KV-cache instance with a specified maximum sequence length. The cache can be allocated lazily (deferred until model loading) to enable auto-split across GPUs. Multiple cache precision modes are available: FP16 (default), Q4 (recommended for memory savings), Q6, and Q8. The cache size determines how many concurrent tokens can be held, which directly affects batching capacity.

Key considerations:

  • Lazy allocation is required for auto-split GPU placement
  • Cache size in tokens determines concurrent capacity (e.g., 32768 tokens = 4 sequences of 8192)
  • Q4 cache mode offers the best memory-to-quality tradeoff
  • Page size is fixed at 256 tokens for paged attention

Step 3: Model_Loading

Load the model weights onto GPU(s). Three loading strategies are available: manual GPU split with explicit per-device VRAM allocation, tensor-parallel split across multiple GPUs using NCCL, or auto-split which probes available VRAM and distributes layers accordingly. The auto-split method requires a lazy cache to be passed for accurate VRAM estimation.

Key considerations:

  • Auto-split is the simplest option for most users
  • Tensor-parallel mode distributes individual layers across GPUs for lower latency
  • Multi-threaded safetensors loading is used for fast weight deserialization
  • A progress callback can be used to track loading status

Step 4: Tokenizer_Initialization

Initialize the tokenizer from the model configuration. ExLlamaV2 supports both SentencePiece and HuggingFace tokenizer formats, auto-detecting the appropriate backend from files in the model directory. The tokenizer handles encoding (text to token IDs), decoding (token IDs to text), special token management, and chat template application.

Key considerations:

  • Tokenizer format is auto-detected from the model directory
  • Special tokens (BOS, EOS, etc.) are handled automatically
  • The tokenizer is needed by the generator for token healing and text decoding

Step 5: Generator_Setup

Create the Dynamic Generator instance by passing the model, cache, and tokenizer. The generator manages the paged KV-cache, job scheduling, prompt caching, and cache deduplication. Optionally run a warmup pass which executes a small completion to allow CUDA kernels to fully initialize and autotune before timing-sensitive workloads.

Key considerations:

  • The generator accepts batch size 1 model/cache but manages batching internally
  • Warmup is recommended before benchmarking but not required for correctness
  • Maximum batch size and queue size can be configured
  • Speculative decoding can be enabled with a draft model

Step 6: Text_Completion

Generate text by calling the generator with one or more prompts, sampling settings, stop conditions, and a maximum token count. For single completions, pass a string prompt. For batched generation, pass a list of strings. The generator automatically handles prompt encoding, cache allocation, scheduling, and output collection. Sampling settings control temperature, top-k, top-p, repetition penalty, and other parameters.

Key considerations:

  • Single string prompt for one completion; list of strings for batch
  • Stop conditions can be token IDs or strings
  • Sampling settings can be shared or per-prompt in batched mode
  • add_bos should be True for base models that expect a BOS token
  • Token healing corrects tokenization artifacts at prompt boundaries

Execution Diagram

GitHub URL

Workflow Repository