Environment:Huggingface Trl Quantization Environment
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Optimization, Quantization |
| Last Updated | 2026-02-06 17:00 GMT |
Overview
Optional bitsandbytes environment for 4-bit and 8-bit model quantization (QLoRA), reducing GPU memory usage during fine-tuning.
Description
This environment provides the bitsandbytes library for loading models in 4-bit or 8-bit quantized formats. When combined with PEFT/LoRA (QLoRA pattern), it enables fine-tuning of large language models on consumer GPUs with limited VRAM. TRL's get_quantization_config utility creates a BitsAndBytesConfig based on the ModelConfig settings (load_in_4bit or load_in_8bit).
Usage
Use this environment when training models that are too large to fit in GPU memory at full precision. Required when setting load_in_4bit=True or load_in_8bit=True in ModelConfig, or when using QLoRA workflows.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| OS | Linux | bitsandbytes has limited Windows support |
| Hardware | NVIDIA GPU with CUDA | bitsandbytes requires CUDA-capable GPU |
| Python | >= 3.10 | Must match TRL core requirements |
Dependencies
Python Packages
- `bitsandbytes`
- `peft` >= 0.8.0 (typically used together for QLoRA)
Credentials
No additional credentials required.
Quick Install
# Install TRL with quantization support
pip install "trl[quantization]"
# Or install bitsandbytes separately
pip install bitsandbytes
# For QLoRA (quantization + PEFT)
pip install "trl[quantization,peft]"
Code Evidence
BitsAndBytesConfig usage in `trl/trainer/utils.py` via transformers import:
from transformers import (
AutoConfig,
BitsAndBytesConfig,
PretrainedConfig,
PreTrainedModel,
is_comet_available,
)
QLoRA bf16 casting in GRPOTrainer (`trl/trainer/grpo_trainer.py:338-346`):
# When using QLoRA, the PEFT adapter weights are converted to bf16 to follow
# the recommendations from the original paper (see https://huggingface.co/papers/2305.14314)
# Non-quantized models do not have the `is_loaded_in_{8,4}bit` attributes
if getattr(model, "is_loaded_in_4bit", False) or getattr(model, "is_loaded_in_8bit", False):
for param in model.parameters():
if param.requires_grad:
param.data = param.data.to(torch.bfloat16)
bitsandbytes conditional import in `trl/generation/vllm_generation.py:51-52`:
if is_bitsandbytes_available():
import bitsandbytes as bnb
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
ImportError: bitsandbytes |
bitsandbytes not installed | pip install bitsandbytes
|
CUDA Setup failed |
CUDA toolkit not found or incompatible | Ensure CUDA toolkit is installed and matches PyTorch CUDA version |
ValueError: CPU & disk offloading is not supported for ValueHead models |
Quantized PPO model offloaded to CPU | Ensure sufficient GPU memory; ValueHead models must remain on GPU |
Compatibility Notes
- QLoRA + PEFT: The
autocast_adapter_dtype=Falseoption is not yet supported for quantized models. TRL manually casts trainable params to bf16 as a workaround. - 4-bit models: Use
bnb_4bit_compute_dtype=bfloat16for optimal performance on Ampere+ GPUs. - PPO ValueHead models: CPU and disk offloading is explicitly unsupported; the model must fit entirely on GPU(s).