Environment:Hiyouga LLaMA Factory Optional Inference Backends
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Inference |
| Last Updated | 2026-02-06 20:00 GMT |
Overview
Optional high-performance inference backends: vLLM for batched GPU inference, SGLang for HTTP-based serving, and KTransformers for CPU/GPU hybrid inference of large MoE models.
Description
LLaMA Factory supports three alternative inference engines beyond the default HuggingFace Transformers backend. vLLM provides high-throughput GPU inference with continuous batching and PagedAttention. SGLang communicates with a separate HTTP server process for structured generation. KTransformers enables inference of very large models (e.g., DeepSeek-V3 671B) by offloading MoE expert layers to CPU while keeping attention on GPU. Each backend is activated via the infer_backend model argument.
Usage
Use this environment when you need high-throughput inference (vLLM), structured generation (SGLang), or inference on models too large for GPU memory (KTransformers). These backends are available for API, CLI, and Web interfaces only, not for training.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Hardware (vLLM) | NVIDIA GPU >= 16GB VRAM | PagedAttention requires CUDA |
| Hardware (KTransformers) | CPU with large RAM + GPU | 128GB+ RAM recommended for 671B models |
| Hardware (SGLang) | NVIDIA GPU | Separate server process |
Dependencies
vLLM
vllm>= 0.4.3, <= 0.11.0
SGLang
sglang[srt]>= 0.4.5flashinfer
KTransformers
ktransformers
Liger Kernel (Triton Optimization)
liger-kernel>= 0.5.5
Unsloth (Optimized Loading)
unsloth
Credentials
No additional credentials required beyond the core environment.
Quick Install
# vLLM
pip install vllm>=0.4.3
# SGLang
pip install sglang[srt]>=0.4.5 flashinfer
# KTransformers
pip install ktransformers
# Liger Kernel (Triton optimizations)
pip install liger-kernel>=0.5.5
# Unsloth (optimized model loading)
pip install unsloth
Code Evidence
Backend version checks from src/llamafactory/hparams/parser.py:160-165:
if model_args.infer_backend == EngineName.VLLM:
check_version("vllm>=0.4.3,<=0.11.0")
check_version("vllm", mandatory=True)
elif model_args.infer_backend == EngineName.SGLANG:
check_version("sglang>=0.4.5")
check_version("sglang", mandatory=True)
vLLM restrictions from src/llamafactory/hparams/parser.py:477-488:
if model_args.infer_backend == "vllm":
if finetuning_args.stage != "sft":
raise ValueError("vLLM engine only supports auto-regressive models.")
if model_args.quantization_bit is not None:
raise ValueError("vLLM engine does not support bnb quantization (GPTQ and AWQ are supported).")
if model_args.rope_scaling is not None:
raise ValueError("vLLM engine does not support RoPE scaling.")
if model_args.adapter_name_or_path is not None and len(model_args.adapter_name_or_path) != 1:
raise ValueError("vLLM only accepts a single adapter. Merge them first.")
Optional package detection from src/llamafactory/extras/packages.py:123-124:
def is_vllm_available():
return _is_package_available("vllm")
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
vLLM engine only supports auto-regressive models |
Using vLLM for non-SFT stage | Set stage=sft for vLLM inference
|
vLLM engine does not support bnb quantization |
BnB quantization with vLLM | Use GPTQ/AWQ quantized models instead |
vLLM only accepts a single adapter |
Multiple LoRA adapters with vLLM | Merge adapters first with export |
vLLM/SGLang backend is only available for API, CLI and Web |
Using vLLM/SGLang for training | Use HuggingFace backend for training |
Compatibility Notes
- vLLM: Supports GPTQ and AWQ pre-quantized models but not BitsAndBytes. Does not support RoPE scaling. Single adapter only.
- SGLang: Communicates via HTTP server. Uses separate process for model serving.
- KTransformers: Incompatible with DeepSpeed ZeRO-3, LoRA reward models. Supports CPU-offloaded MoE inference. Set
USE_KT=1to enable. - Liger Kernel: Provides Triton-optimized kernels for supported model architectures. Set
enable_liger_kernel=True. - Unsloth: Incompatible with DeepSpeed ZeRO-3 and LoRA reward models.