Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Workflow:Turboderp org Exllamav2 LoRA Adapter Inference

From Leeroopedia
Knowledge Sources
Domains LLMs, Inference, LoRA, Fine_Tuning
Last Updated 2026-02-15 00:00 GMT

Overview

End-to-end process for loading a quantized base model alongside one or more LoRA adapters and generating text that reflects the adapter's fine-tuned behavior.

Description

This workflow demonstrates how to apply LoRA (Low-Rank Adaptation) adapters to a quantized ExLlamaV2 model at inference time. LoRA adapters are small weight matrices trained on top of a frozen base model to specialize its behavior for particular tasks or formats. ExLlamaV2 supports loading LoRA adapters from standard HuggingFace format directories, applying them to the generator, and even using multiple LoRA adapters concurrently in a batched setting where different requests use different adapters.

Usage

Execute this workflow when you have a quantized base model and one or more LoRA adapter directories (containing adapter_config.json and adapter_model.safetensors) and need to generate text using the adapter's fine-tuned behavior. This is useful for comparing base model vs. adapted model outputs, serving multiple fine-tuned variants from a single base model, or testing LoRA adapters during development.

Execution Steps

Step 1: Base_Model_Loading

Initialize the model configuration, create a KV-cache with lazy allocation, and load the quantized base model using auto-split across available GPUs. Initialize the tokenizer from the model configuration. This follows the standard model loading procedure identical to the Text Generation workflow.

Key considerations:

  • The base model must be loaded before LoRA adapters
  • Auto-split with lazy cache is recommended for optimal VRAM distribution
  • The base model can be any supported quantized format (EXL2, GPTQ)

Step 2: LoRA_Adapter_Loading

Load one or more LoRA adapters from their directories using the LoRA loader. Each adapter directory must contain an adapter_config.json (specifying rank, alpha, target modules) and weight files (adapter_model.safetensors or .bin). The loader maps LoRA weight matrices to the corresponding layers in the base model and prepares them for on-the-fly application during forward passes.

Key considerations:

  • LoRA adapters are loaded from standard HuggingFace PEFT format
  • Multiple adapters can be loaded simultaneously
  • Adapters add minimal VRAM overhead (typically <1% of base model)
  • The adapter's target modules must match the base model architecture

Step 3: Generator_Configuration

Create the Dynamic Generator with the base model, cache, and tokenizer. Then activate the desired LoRA adapter(s) on the generator using the set_loras method. When LoRA is active, all forward passes through the model will apply the adapter's low-rank weight modifications to the targeted layers (typically attention Q/K/V/O and MLP projections).

Key considerations:

  • LoRA can be set, changed, or removed between generations
  • Multiple LoRAs can be applied simultaneously for multi-adapter batching
  • LoRA application is transparent to the sampling and generation logic
  • The generator handles LoRA weight injection during the forward pass

Step 4: Adapted_Generation

Generate text using the same API as standard text generation, but with the LoRA adapter modifying the model's behavior. The output will reflect the fine-tuned characteristics of the adapter (e.g., different response format, domain knowledge, or style). Optionally compare outputs with and without the adapter by toggling it on/off between generations.

Key considerations:

  • Prompt format should match the adapter's training format for best results
  • Greedy sampling can help isolate the adapter's effect vs. sampling randomness
  • Stop conditions should be set to match the adapter's expected output format
  • The same base model can serve different adapters for different requests

Execution Diagram

GitHub URL

Workflow Repository