Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Environment:Hiyouga LLaMA Factory Optional Inference Backends

From Leeroopedia


Knowledge Sources
Domains Infrastructure, Inference
Last Updated 2026-02-06 20:00 GMT

Overview

Optional high-performance inference backends: vLLM for batched GPU inference, SGLang for HTTP-based serving, and KTransformers for CPU/GPU hybrid inference of large MoE models.

Description

LLaMA Factory supports three alternative inference engines beyond the default HuggingFace Transformers backend. vLLM provides high-throughput GPU inference with continuous batching and PagedAttention. SGLang communicates with a separate HTTP server process for structured generation. KTransformers enables inference of very large models (e.g., DeepSeek-V3 671B) by offloading MoE expert layers to CPU while keeping attention on GPU. Each backend is activated via the infer_backend model argument.

Usage

Use this environment when you need high-throughput inference (vLLM), structured generation (SGLang), or inference on models too large for GPU memory (KTransformers). These backends are available for API, CLI, and Web interfaces only, not for training.

System Requirements

Category Requirement Notes
Hardware (vLLM) NVIDIA GPU >= 16GB VRAM PagedAttention requires CUDA
Hardware (KTransformers) CPU with large RAM + GPU 128GB+ RAM recommended for 671B models
Hardware (SGLang) NVIDIA GPU Separate server process

Dependencies

vLLM

  • vllm >= 0.4.3, <= 0.11.0

SGLang

  • sglang[srt] >= 0.4.5
  • flashinfer

KTransformers

  • ktransformers

Liger Kernel (Triton Optimization)

  • liger-kernel >= 0.5.5

Unsloth (Optimized Loading)

  • unsloth

Credentials

No additional credentials required beyond the core environment.

Quick Install

# vLLM
pip install vllm>=0.4.3

# SGLang
pip install sglang[srt]>=0.4.5 flashinfer

# KTransformers
pip install ktransformers

# Liger Kernel (Triton optimizations)
pip install liger-kernel>=0.5.5

# Unsloth (optimized model loading)
pip install unsloth

Code Evidence

Backend version checks from src/llamafactory/hparams/parser.py:160-165:

if model_args.infer_backend == EngineName.VLLM:
    check_version("vllm>=0.4.3,<=0.11.0")
    check_version("vllm", mandatory=True)
elif model_args.infer_backend == EngineName.SGLANG:
    check_version("sglang>=0.4.5")
    check_version("sglang", mandatory=True)

vLLM restrictions from src/llamafactory/hparams/parser.py:477-488:

if model_args.infer_backend == "vllm":
    if finetuning_args.stage != "sft":
        raise ValueError("vLLM engine only supports auto-regressive models.")
    if model_args.quantization_bit is not None:
        raise ValueError("vLLM engine does not support bnb quantization (GPTQ and AWQ are supported).")
    if model_args.rope_scaling is not None:
        raise ValueError("vLLM engine does not support RoPE scaling.")
    if model_args.adapter_name_or_path is not None and len(model_args.adapter_name_or_path) != 1:
        raise ValueError("vLLM only accepts a single adapter. Merge them first.")

Optional package detection from src/llamafactory/extras/packages.py:123-124:

def is_vllm_available():
    return _is_package_available("vllm")

Common Errors

Error Message Cause Solution
vLLM engine only supports auto-regressive models Using vLLM for non-SFT stage Set stage=sft for vLLM inference
vLLM engine does not support bnb quantization BnB quantization with vLLM Use GPTQ/AWQ quantized models instead
vLLM only accepts a single adapter Multiple LoRA adapters with vLLM Merge adapters first with export
vLLM/SGLang backend is only available for API, CLI and Web Using vLLM/SGLang for training Use HuggingFace backend for training

Compatibility Notes

  • vLLM: Supports GPTQ and AWQ pre-quantized models but not BitsAndBytes. Does not support RoPE scaling. Single adapter only.
  • SGLang: Communicates via HTTP server. Uses separate process for model serving.
  • KTransformers: Incompatible with DeepSpeed ZeRO-3, LoRA reward models. Supports CPU-offloaded MoE inference. Set USE_KT=1 to enable.
  • Liger Kernel: Provides Triton-optimized kernels for supported model architectures. Set enable_liger_kernel=True.
  • Unsloth: Incompatible with DeepSpeed ZeRO-3 and LoRA reward models.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment