Environment:Hiyouga LLaMA Factory Optional Inference Backends

Knowledge Sources	LLaMA-Factory vLLM SGLang KTransformers
Domains	Infrastructure, Inference
Last Updated	2026-02-06 20:00 GMT

Overview

Optional high-performance inference backends: vLLM for batched GPU inference, SGLang for HTTP-based serving, and KTransformers for CPU/GPU hybrid inference of large MoE models.

Description

LLaMA Factory supports three alternative inference engines beyond the default HuggingFace Transformers backend. vLLM provides high-throughput GPU inference with continuous batching and PagedAttention. SGLang communicates with a separate HTTP server process for structured generation. KTransformers enables inference of very large models (e.g., DeepSeek-V3 671B) by offloading MoE expert layers to CPU while keeping attention on GPU. Each backend is activated via the infer_backend model argument.

Usage

Use this environment when you need high-throughput inference (vLLM), structured generation (SGLang), or inference on models too large for GPU memory (KTransformers). These backends are available for API, CLI, and Web interfaces only, not for training.

System Requirements

Category	Requirement	Notes
Hardware (vLLM)	NVIDIA GPU >= 16GB VRAM	PagedAttention requires CUDA
Hardware (KTransformers)	CPU with large RAM + GPU	128GB+ RAM recommended for 671B models
Hardware (SGLang)	NVIDIA GPU	Separate server process

Dependencies

vLLM

vllm >= 0.4.3, <= 0.11.0

SGLang

sglang[srt] >= 0.4.5
flashinfer

KTransformers

ktransformers

Liger Kernel (Triton Optimization)

liger-kernel >= 0.5.5

Unsloth (Optimized Loading)

unsloth

Credentials

No additional credentials required beyond the core environment.

Quick Install

# vLLM
pip install vllm>=0.4.3

# SGLang
pip install sglang[srt]>=0.4.5 flashinfer

# KTransformers
pip install ktransformers

# Liger Kernel (Triton optimizations)
pip install liger-kernel>=0.5.5

# Unsloth (optimized model loading)
pip install unsloth

Code Evidence

Backend version checks from src/llamafactory/hparams/parser.py:160-165:

if model_args.infer_backend == EngineName.VLLM:
    check_version("vllm>=0.4.3,<=0.11.0")
    check_version("vllm", mandatory=True)
elif model_args.infer_backend == EngineName.SGLANG:
    check_version("sglang>=0.4.5")
    check_version("sglang", mandatory=True)

vLLM restrictions from src/llamafactory/hparams/parser.py:477-488:

if model_args.infer_backend == "vllm":
    if finetuning_args.stage != "sft":
        raise ValueError("vLLM engine only supports auto-regressive models.")
    if model_args.quantization_bit is not None:
        raise ValueError("vLLM engine does not support bnb quantization (GPTQ and AWQ are supported).")
    if model_args.rope_scaling is not None:
        raise ValueError("vLLM engine does not support RoPE scaling.")
    if model_args.adapter_name_or_path is not None and len(model_args.adapter_name_or_path) != 1:
        raise ValueError("vLLM only accepts a single adapter. Merge them first.")

Optional package detection from src/llamafactory/extras/packages.py:123-124:

def is_vllm_available():
    return _is_package_available("vllm")

Common Errors

Error Message	Cause	Solution
`vLLM engine only supports auto-regressive models`	Using vLLM for non-SFT stage	Set `stage=sft` for vLLM inference
`vLLM engine does not support bnb quantization`	BnB quantization with vLLM	Use GPTQ/AWQ quantized models instead
`vLLM only accepts a single adapter`	Multiple LoRA adapters with vLLM	Merge adapters first with export
`vLLM/SGLang backend is only available for API, CLI and Web`	Using vLLM/SGLang for training	Use HuggingFace backend for training

Compatibility Notes

vLLM: Supports GPTQ and AWQ pre-quantized models but not BitsAndBytes. Does not support RoPE scaling. Single adapter only.
SGLang: Communicates via HTTP server. Uses separate process for model serving.
KTransformers: Incompatible with DeepSpeed ZeRO-3, LoRA reward models. Supports CPU-offloaded MoE inference. Set USE_KT=1 to enable.
Liger Kernel: Provides Triton-optimized kernels for supported model architectures. Set enable_liger_kernel=True.
Unsloth: Incompatible with DeepSpeed ZeRO-3 and LoRA reward models.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment