Principle:Liu00222 Open Prompt Injection QLoRA Model Loading
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Loading, Quantization |
| Last Updated | 2026-02-14 15:00 GMT |
Overview
A technique for loading large language models with 4-bit quantization and LoRA adapter overlays to enable fine-tuned model inference on consumer-grade GPUs.
Description
QLoRA (Quantized Low-Rank Adaptation) Model Loading combines two efficiency techniques: (1) 4-bit NormalFloat quantization of the base model weights using bitsandbytes, which reduces memory by ~4x compared to FP16, and (2) LoRA adapter loading using PEFT (Parameter-Efficient Fine-Tuning), which overlays a small set of trainable parameters on top of the frozen quantized base. In this repository, QLoRA is used to load fine-tuned Mistral models for the DataSentinel injection detection system.
Usage
Use this principle when you need to load a fine-tuned model for inference with limited GPU memory. Specifically used in the DataSentinel detection and PromptLocate localization pipelines where a QLoRA-fine-tuned Mistral model serves as the detection backbone.
Theoretical Basis
QLoRA freezes the base model in 4-bit precision and trains low-rank adaptation matrices:
Where is the 4-bit quantized base weights and is the low-rank decomposition with rank .
Configuration used in this repo:
- Quantization: NF4 (NormalFloat 4-bit)
- Compute dtype: float16
- Double quantization: enabled
- Base model: Mistral-7B variants