Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Unslothai Unsloth FastLanguageModel From Pretrained Vllm

From Leeroopedia


Knowledge Sources
Domains NLP, Reinforcement_Learning, Inference
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for loading language models with vLLM fast inference backend for reinforcement learning workflows provided by the Unsloth library.

Description

This is the same FastLanguageModel.from_pretrained API as the standard QLoRA loader, but configured with fast_inference=True to attach a vLLM inference engine. The vLLM engine enables high-throughput batched generation during GRPO rollouts. The model gains a fast_generate method for vLLM-accelerated inference while retaining standard PyTorch training capability.

This implementation emphasizes the RL-specific parameters: fast_inference, gpu_memory_utilization, max_lora_rank, and float8_kv_cache.

Usage

Use this when setting up GRPO, PPO, or other RL training with vLLM-accelerated generation. NOT for standard SFT training (use the standard FastLanguageModel_From_Pretrained instead). Requires vLLM to be installed.

Code Reference

Source Location

  • Repository: unsloth
  • File: unsloth/models/loader.py
  • Lines: L121-696

Signature

class FastLanguageModel(FastLlamaModel):
    @staticmethod
    def from_pretrained(
        model_name = "unsloth/Llama-3.2-1B-Instruct",
        max_seq_length = 2048,
        dtype = None,
        load_in_4bit = True,
        fast_inference = True,          # Key: enables vLLM engine
        gpu_memory_utilization = 0.5,   # Key: vLLM GPU memory fraction
        max_lora_rank = 64,             # Key: max LoRA rank for vLLM
        float8_kv_cache = False,        # Key: FP8 KV cache for memory savings
        disable_log_stats = True,
        random_state = 3407,
        *args,
        **kwargs,
    ) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
        """
        Loads model with vLLM inference engine for RL training.

        RL-specific Args:
            fast_inference: Must be True. Enables vLLM engine.
            gpu_memory_utilization: Fraction of GPU memory for vLLM KV cache.
                Lower values leave more memory for training. Default 0.5.
            max_lora_rank: Maximum LoRA rank vLLM can serve. RL typically
                uses higher ranks (64) than SFT (16). Default 64.
            float8_kv_cache: Use FP8 for KV cache to save memory. Default False.
        """

Import

from unsloth import FastLanguageModel

I/O Contract

Inputs

Name Type Required Description
model_name str No HuggingFace model ID or local path
max_seq_length int No Maximum context length (default: 2048)
load_in_4bit bool No Enable 4-bit quantization (default: True)
fast_inference bool Yes (for RL) Must be True to enable vLLM engine
gpu_memory_utilization float No vLLM GPU memory fraction (default: 0.5)
max_lora_rank int No Max LoRA rank for vLLM serving (default: 64)
float8_kv_cache bool No FP8 KV cache (default: False)

Outputs

Name Type Description
model PreTrainedModel Model with vLLM engine attached; has fast_generate method for batched inference
tokenizer PreTrainedTokenizer Configured tokenizer

Usage Examples

GRPO Model Loading

from unsloth import FastLanguageModel

# Load model with vLLM for GRPO training
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-3B-Instruct",
    max_seq_length=4096,
    load_in_4bit=True,
    fast_inference=True,              # Enable vLLM
    gpu_memory_utilization=0.6,       # 60% GPU for vLLM KV cache
    max_lora_rank=64,                 # RL uses higher LoRA rank
)

# Model now has fast_generate for vLLM-accelerated generation
# Used by GRPOTrainer internally for rollouts

Related Pages

Implements Principle

Requires Environment

Uses Heuristic

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment