Implementation:Unslothai Unsloth FastLanguageModel From Pretrained
| Knowledge Sources | |
|---|---|
| Domains | NLP, Model_Architecture, Quantization |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading pretrained language models with 4-bit quantization and optimized Triton kernels provided by the Unsloth library.
Description
FastLanguageModel.from_pretrained is the primary entry point for loading language models in Unsloth. It auto-detects the model architecture from the HuggingFace config, applies BitsAndBytes 4-bit quantization (NF4), patches forward methods with optimized Triton kernels (RoPE, RMSNorm, cross-entropy), and returns a model ready for LoRA adapter injection. Supports 14+ model architectures including Llama, Mistral, Gemma, Qwen, Cohere, Granite, and Falcon.
This implementation focuses on the QLoRA SFT use case: loading a model in 4-bit for memory-efficient supervised fine-tuning. For RL workflows with vLLM fast inference, see the vLLM-specific variant.
Usage
Import this as the first step in any QLoRA fine-tuning workflow. Use load_in_4bit=True (default) for standard QLoRA training on consumer GPUs. Use load_in_8bit=True or load_in_16bit=True for higher precision when VRAM allows.
Code Reference
Source Location
- Repository: unsloth
- File: unsloth/models/loader.py
- Lines: L121-696 (dispatches to architecture-specific loaders like FastLlamaModel.from_pretrained at unsloth/models/llama.py:L2133-2633)
Signature
class FastLanguageModel(FastLlamaModel):
@staticmethod
def from_pretrained(
model_name = "unsloth/Llama-3.2-1B-Instruct",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
load_in_8bit = False,
load_in_16bit = False,
full_finetuning = False,
token = None,
device_map = "sequential",
rope_scaling = None,
fix_tokenizer = True,
trust_remote_code = False,
use_gradient_checkpointing = "unsloth",
resize_model_vocab = None,
revision = None,
use_exact_model_name = False,
offload_embedding = False,
float32_mixed_precision = None,
fast_inference = False,
gpu_memory_utilization = 0.5,
float8_kv_cache = False,
random_state = 3407,
max_lora_rank = 64,
disable_log_stats = True,
qat_scheme = None,
load_in_fp8 = False,
unsloth_tiled_mlp = False,
*args,
**kwargs,
) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
"""
Loads a pretrained language model with optional quantization and
Unsloth kernel optimizations.
Args:
model_name: HuggingFace model ID or local path.
max_seq_length: Maximum context length for RoPE scaling.
dtype: Compute dtype (auto-selects bf16 if supported, else fp16).
load_in_4bit: Enable 4-bit NF4 quantization (QLoRA). Default True.
load_in_8bit: Enable 8-bit quantization.
load_in_16bit: Load in float16 without quantization.
full_finetuning: Disable LoRA, train all parameters.
token: HuggingFace Hub authentication token.
device_map: Device placement strategy. Default "sequential".
fix_tokenizer: Auto-repair tokenizer issues. Default True.
use_gradient_checkpointing: "unsloth" for optimized checkpointing.
fast_inference: Enable vLLM inference engine (for RL).
random_state: Random seed. Default 3407.
"""
Import
from unsloth import FastLanguageModel
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name | str | No | HuggingFace model ID or local path (default: "unsloth/Llama-3.2-1B-Instruct") |
| max_seq_length | int | No | Maximum context length (default: 2048) |
| dtype | torch.dtype | No | Compute dtype; auto-selects bf16/fp16 if None |
| load_in_4bit | bool | No | Enable 4-bit QLoRA quantization (default: True) |
| load_in_8bit | bool | No | Enable 8-bit quantization (default: False) |
| token | str | No | HuggingFace Hub auth token for gated models |
| fast_inference | bool | No | Enable vLLM engine (default: False, use True for RL) |
Outputs
| Name | Type | Description |
|---|---|---|
| model | PreTrainedModel | Patched model with optimized Triton kernels and quantization applied |
| tokenizer | PreTrainedTokenizer | Configured tokenizer with fixed special tokens |
Usage Examples
Standard 4-bit QLoRA Loading
from unsloth import FastLanguageModel
# Load Llama 3.2 in 4-bit for QLoRA fine-tuning
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Llama-3.2-3B-Instruct",
max_seq_length=4096,
dtype=None, # Auto-detect bf16/fp16
load_in_4bit=True, # 4-bit QLoRA
)
# Model is ready for LoRA adapter injection
print(f"Model dtype: {model.dtype}")
print(f"Tokenizer vocab size: {len(tokenizer)}")
16-bit Loading for Full Fine-tuning
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="meta-llama/Llama-3.2-1B",
max_seq_length=2048,
load_in_4bit=False,
load_in_16bit=True,
full_finetuning=True,
token="hf_your_token_here",
)