Workflow:Turboderp org Exllamav2 LoRA Adapter Inference

Knowledge Sources	ExLlamaV2 Dynamic Generator Guide
Domains	LLMs, Inference, LoRA, Fine_Tuning
Last Updated	2026-02-15 00:00 GMT

Overview

End-to-end process for loading a quantized base model alongside one or more LoRA adapters and generating text that reflects the adapter's fine-tuned behavior.

Description

This workflow demonstrates how to apply LoRA (Low-Rank Adaptation) adapters to a quantized ExLlamaV2 model at inference time. LoRA adapters are small weight matrices trained on top of a frozen base model to specialize its behavior for particular tasks or formats. ExLlamaV2 supports loading LoRA adapters from standard HuggingFace format directories, applying them to the generator, and even using multiple LoRA adapters concurrently in a batched setting where different requests use different adapters.

Usage

Execute this workflow when you have a quantized base model and one or more LoRA adapter directories (containing adapter_config.json and adapter_model.safetensors) and need to generate text using the adapter's fine-tuned behavior. This is useful for comparing base model vs. adapted model outputs, serving multiple fine-tuned variants from a single base model, or testing LoRA adapters during development.

Execution Steps

Step 1: Base_Model_Loading

Initialize the model configuration, create a KV-cache with lazy allocation, and load the quantized base model using auto-split across available GPUs. Initialize the tokenizer from the model configuration. This follows the standard model loading procedure identical to the Text Generation workflow.

Key considerations:

The base model must be loaded before LoRA adapters
Auto-split with lazy cache is recommended for optimal VRAM distribution
The base model can be any supported quantized format (EXL2, GPTQ)

Step 2: LoRA_Adapter_Loading

Load one or more LoRA adapters from their directories using the LoRA loader. Each adapter directory must contain an adapter_config.json (specifying rank, alpha, target modules) and weight files (adapter_model.safetensors or .bin). The loader maps LoRA weight matrices to the corresponding layers in the base model and prepares them for on-the-fly application during forward passes.

Key considerations:

LoRA adapters are loaded from standard HuggingFace PEFT format
Multiple adapters can be loaded simultaneously
Adapters add minimal VRAM overhead (typically <1% of base model)
The adapter's target modules must match the base model architecture

Step 3: Generator_Configuration

Create the Dynamic Generator with the base model, cache, and tokenizer. Then activate the desired LoRA adapter(s) on the generator using the set_loras method. When LoRA is active, all forward passes through the model will apply the adapter's low-rank weight modifications to the targeted layers (typically attention Q/K/V/O and MLP projections).

Key considerations:

LoRA can be set, changed, or removed between generations
Multiple LoRAs can be applied simultaneously for multi-adapter batching
LoRA application is transparent to the sampling and generation logic
The generator handles LoRA weight injection during the forward pass

Step 4: Adapted_Generation

Generate text using the same API as standard text generation, but with the LoRA adapter modifying the model's behavior. The output will reflect the fine-tuned characteristics of the adapter (e.g., different response format, domain knowledge, or style). Optionally compare outputs with and without the adapter by toggling it on/off between generations.

Key considerations:

Prompt format should match the adapter's training format for best results
Greedy sampling can help isolate the adapter's effect vs. sampling randomness
Stop conditions should be set to match the adapter's expected output format
The same base model can serve different adapters for different requests

Execution Diagram

GitHub URL

Workflow Repository