Principle:Liu00222 Open Prompt Injection QLoRA Model Loading

Knowledge Sources	Open-Prompt-Injection QLoRA
Domains	NLP, Model_Loading, Quantization
Last Updated	2026-02-14 15:00 GMT

Overview

A technique for loading large language models with 4-bit quantization and LoRA adapter overlays to enable fine-tuned model inference on consumer-grade GPUs.

Description

QLoRA (Quantized Low-Rank Adaptation) Model Loading combines two efficiency techniques: (1) 4-bit NormalFloat quantization of the base model weights using bitsandbytes, which reduces memory by ~4x compared to FP16, and (2) LoRA adapter loading using PEFT (Parameter-Efficient Fine-Tuning), which overlays a small set of trainable parameters on top of the frozen quantized base. In this repository, QLoRA is used to load fine-tuned Mistral models for the DataSentinel injection detection system.

Usage

Use this principle when you need to load a fine-tuned model for inference with limited GPU memory. Specifically used in the DataSentinel detection and PromptLocate localization pipelines where a QLoRA-fine-tuned Mistral model serves as the detection backbone.

Theoretical Basis

QLoRA freezes the base model in 4-bit precision and trains low-rank adaptation matrices:

$W = W_{b a s e}^{N F 4} + Δ W, Δ W = B A$

Where $W_{b a s e}^{N F 4}$ is the 4-bit quantized base weights and $B A$ is the low-rank decomposition with rank $r$ .

Configuration used in this repo:

Quantization: NF4 (NormalFloat 4-bit)
Compute dtype: float16
Double quantization: enabled
Base model: Mistral-7B variants

Related Pages

Implemented By

Implementation:Liu00222_Open_Prompt_Injection_QLoraModel_init

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment