Implementation:Unslothai Unsloth FastLanguageModel From Pretrained

Knowledge Sources	Unsloth BitsAndBytes Quantization QLoRA
Domains	NLP, Model_Architecture, Quantization
Last Updated	2026-02-07 00:00 GMT

Overview

Concrete tool for loading pretrained language models with 4-bit quantization and optimized Triton kernels provided by the Unsloth library.

Description

FastLanguageModel.from_pretrained is the primary entry point for loading language models in Unsloth. It auto-detects the model architecture from the HuggingFace config, applies BitsAndBytes 4-bit quantization (NF4), patches forward methods with optimized Triton kernels (RoPE, RMSNorm, cross-entropy), and returns a model ready for LoRA adapter injection. Supports 14+ model architectures including Llama, Mistral, Gemma, Qwen, Cohere, Granite, and Falcon.

This implementation focuses on the QLoRA SFT use case: loading a model in 4-bit for memory-efficient supervised fine-tuning. For RL workflows with vLLM fast inference, see the vLLM-specific variant.

Usage

Import this as the first step in any QLoRA fine-tuning workflow. Use load_in_4bit=True (default) for standard QLoRA training on consumer GPUs. Use load_in_8bit=True or load_in_16bit=True for higher precision when VRAM allows.

Code Reference

Source Location

Repository: unsloth
File: unsloth/models/loader.py
Lines: L121-696 (dispatches to architecture-specific loaders like FastLlamaModel.from_pretrained at unsloth/models/llama.py:L2133-2633)

Signature

class FastLanguageModel(FastLlamaModel):
    @staticmethod
    def from_pretrained(
        model_name = "unsloth/Llama-3.2-1B-Instruct",
        max_seq_length = 2048,
        dtype = None,
        load_in_4bit = True,
        load_in_8bit = False,
        load_in_16bit = False,
        full_finetuning = False,
        token = None,
        device_map = "sequential",
        rope_scaling = None,
        fix_tokenizer = True,
        trust_remote_code = False,
        use_gradient_checkpointing = "unsloth",
        resize_model_vocab = None,
        revision = None,
        use_exact_model_name = False,
        offload_embedding = False,
        float32_mixed_precision = None,
        fast_inference = False,
        gpu_memory_utilization = 0.5,
        float8_kv_cache = False,
        random_state = 3407,
        max_lora_rank = 64,
        disable_log_stats = True,
        qat_scheme = None,
        load_in_fp8 = False,
        unsloth_tiled_mlp = False,
        *args,
        **kwargs,
    ) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
        """
        Loads a pretrained language model with optional quantization and
        Unsloth kernel optimizations.

        Args:
            model_name: HuggingFace model ID or local path.
            max_seq_length: Maximum context length for RoPE scaling.
            dtype: Compute dtype (auto-selects bf16 if supported, else fp16).
            load_in_4bit: Enable 4-bit NF4 quantization (QLoRA). Default True.
            load_in_8bit: Enable 8-bit quantization.
            load_in_16bit: Load in float16 without quantization.
            full_finetuning: Disable LoRA, train all parameters.
            token: HuggingFace Hub authentication token.
            device_map: Device placement strategy. Default "sequential".
            fix_tokenizer: Auto-repair tokenizer issues. Default True.
            use_gradient_checkpointing: "unsloth" for optimized checkpointing.
            fast_inference: Enable vLLM inference engine (for RL).
            random_state: Random seed. Default 3407.
        """

Import

from unsloth import FastLanguageModel

I/O Contract

Inputs

Name	Type	Required	Description
model_name	str	No	HuggingFace model ID or local path (default: "unsloth/Llama-3.2-1B-Instruct")
max_seq_length	int	No	Maximum context length (default: 2048)
dtype	torch.dtype	No	Compute dtype; auto-selects bf16/fp16 if None
load_in_4bit	bool	No	Enable 4-bit QLoRA quantization (default: True)
load_in_8bit	bool	No	Enable 8-bit quantization (default: False)
token	str	No	HuggingFace Hub auth token for gated models
fast_inference	bool	No	Enable vLLM engine (default: False, use True for RL)

Outputs

Name	Type	Description
model	PreTrainedModel	Patched model with optimized Triton kernels and quantization applied
tokenizer	PreTrainedTokenizer	Configured tokenizer with fixed special tokens

Usage Examples

Standard 4-bit QLoRA Loading

from unsloth import FastLanguageModel

# Load Llama 3.2 in 4-bit for QLoRA fine-tuning
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Llama-3.2-3B-Instruct",
    max_seq_length=4096,
    dtype=None,         # Auto-detect bf16/fp16
    load_in_4bit=True,  # 4-bit QLoRA
)

# Model is ready for LoRA adapter injection
print(f"Model dtype: {model.dtype}")
print(f"Tokenizer vocab size: {len(tokenizer)}")

16-bit Loading for Full Fine-tuning

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Llama-3.2-1B",
    max_seq_length=2048,
    load_in_4bit=False,
    load_in_16bit=True,
    full_finetuning=True,
    token="hf_your_token_here",
)

Related Pages

Implements Principle

Principle:Unslothai_Unsloth_Quantized_Model_Loading

Requires Environment

Environment:Unslothai_Unsloth_CUDA_BitsAndBytes

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment