Implementation:Unslothai Unsloth FastLanguageModel From Pretrained Vllm
| Knowledge Sources | |
|---|---|
| Domains | NLP, Reinforcement_Learning, Inference |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for loading language models with vLLM fast inference backend for reinforcement learning workflows provided by the Unsloth library.
Description
This is the same FastLanguageModel.from_pretrained API as the standard QLoRA loader, but configured with fast_inference=True to attach a vLLM inference engine. The vLLM engine enables high-throughput batched generation during GRPO rollouts. The model gains a fast_generate method for vLLM-accelerated inference while retaining standard PyTorch training capability.
This implementation emphasizes the RL-specific parameters: fast_inference, gpu_memory_utilization, max_lora_rank, and float8_kv_cache.
Usage
Use this when setting up GRPO, PPO, or other RL training with vLLM-accelerated generation. NOT for standard SFT training (use the standard FastLanguageModel_From_Pretrained instead). Requires vLLM to be installed.
Code Reference
Source Location
- Repository: unsloth
- File: unsloth/models/loader.py
- Lines: L121-696
Signature
class FastLanguageModel(FastLlamaModel):
@staticmethod
def from_pretrained(
model_name = "unsloth/Llama-3.2-1B-Instruct",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
fast_inference = True, # Key: enables vLLM engine
gpu_memory_utilization = 0.5, # Key: vLLM GPU memory fraction
max_lora_rank = 64, # Key: max LoRA rank for vLLM
float8_kv_cache = False, # Key: FP8 KV cache for memory savings
disable_log_stats = True,
random_state = 3407,
*args,
**kwargs,
) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
"""
Loads model with vLLM inference engine for RL training.
RL-specific Args:
fast_inference: Must be True. Enables vLLM engine.
gpu_memory_utilization: Fraction of GPU memory for vLLM KV cache.
Lower values leave more memory for training. Default 0.5.
max_lora_rank: Maximum LoRA rank vLLM can serve. RL typically
uses higher ranks (64) than SFT (16). Default 64.
float8_kv_cache: Use FP8 for KV cache to save memory. Default False.
"""
Import
from unsloth import FastLanguageModel
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_name | str | No | HuggingFace model ID or local path |
| max_seq_length | int | No | Maximum context length (default: 2048) |
| load_in_4bit | bool | No | Enable 4-bit quantization (default: True) |
| fast_inference | bool | Yes (for RL) | Must be True to enable vLLM engine |
| gpu_memory_utilization | float | No | vLLM GPU memory fraction (default: 0.5) |
| max_lora_rank | int | No | Max LoRA rank for vLLM serving (default: 64) |
| float8_kv_cache | bool | No | FP8 KV cache (default: False) |
Outputs
| Name | Type | Description |
|---|---|---|
| model | PreTrainedModel | Model with vLLM engine attached; has fast_generate method for batched inference |
| tokenizer | PreTrainedTokenizer | Configured tokenizer |
Usage Examples
GRPO Model Loading
from unsloth import FastLanguageModel
# Load model with vLLM for GRPO training
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen2.5-3B-Instruct",
max_seq_length=4096,
load_in_4bit=True,
fast_inference=True, # Enable vLLM
gpu_memory_utilization=0.6, # 60% GPU for vLLM KV cache
max_lora_rank=64, # RL uses higher LoRA rank
)
# Model now has fast_generate for vLLM-accelerated generation
# Used by GRPOTrainer internally for rollouts