Principle:Hiyouga LLaMA Factory Inference Engine Architecture

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Software Architecture, NLP
Last Updated	2026-02-06 19:00 GMT

Overview

The inference engine architecture in LLaMA-Factory implements a backend-agnostic abstraction layer that enables transparent switching between HuggingFace Transformers, vLLM, SGLang, and KTransformers inference backends through a unified chat and scoring API.

Description

LLaMA-Factory serves fine-tuned models through multiple inference backends, each optimized for different deployment scenarios. The architecture uses the strategy pattern where an abstract BaseEngine defines the interface, and concrete engine implementations handle backend-specific details. A ChatModel facade provides both synchronous and asynchronous access to any backend.

The architecture spans two generations:

v0 (legacy) architecture:

BaseEngine (abstract): Defines the async interface with three core methods: chat() for batch generation, stream_chat() for token-by-token streaming, and get_scores() for reward model scoring.
HuggingfaceEngine: Uses model.generate() with GenerationConfig, processes multimodal inputs (images, videos, audios), and supports placeholder-based media injection. Uses TextIteratorStreamer for streaming.
VllmEngine: Wraps vLLM's AsyncLLMEngine with SamplingParams for high-throughput batched inference with continuous batching and PagedAttention.
SGLangEngine: Integrates SGLang's runtime for optimized inference with RadixAttention.
KTransformersEngine: Uses KTransformers for heterogeneous CPU+GPU inference of large sparse models (e.g., DeepSeek-V3).
ChatModel: The user-facing facade that selects the appropriate engine based on infer_backend, runs an async event loop in a background thread, and exposes both sync (chat, stream_chat, get_scores) and async (achat, astream_chat, aget_scores) methods.

v1 architecture:

BaseEngine (abstract): Simplified interface with generate() for streaming and batch_infer() for batch processing.
HuggingFaceEngine: Uses AsyncTextIteratorStreamer for native async streaming, operates on the DistributedInterface for device management.
BaseSampler: Wraps the engine and dispatches to the configured backend based on SampleBackend.
SyncSampler (CLI sampler): Extends BaseSampler with a synchronous interface for interactive CLI chat, bridging the async engine to synchronous iteration via a background event loop.

Usage

The inference engine architecture is used in:

CLI chat (llamafactory-cli chat): Interactive command-line conversations using stream_chat for real-time output.
API serving (llamafactory-cli api): OpenAI-compatible REST API that delegates to the selected engine.
Web UI chat (llamafactory-cli webchat): Browser-based chat interface.
Programmatic access: Direct instantiation of ChatModel for integration into custom applications.

Backend selection is controlled by the infer_backend parameter: "huggingface", "vllm", "sglang", or "ktransformers".

Theoretical Basis

The architecture implements the strategy pattern combined with the facade pattern:

ChatModel (Facade)
  |
  +-- BaseEngine (Strategy Interface)
        |
        +-- HuggingfaceEngine  (HF Transformers)
        +-- VllmEngine          (vLLM)
        +-- SGLangEngine        (SGLang)
        +-- KTransformersEngine (KTransformers)

The BaseEngine abstract class defines the contract:

class BaseEngine(ABC):
    name: EngineName
    model: Union[PreTrainedModel, AsyncLLMEngine]
    tokenizer: PreTrainedTokenizer
    can_generate: bool
    template: Template
    generating_args: dict[str, Any]

    @abstractmethod
    async def chat(self, messages, system, tools, images, videos, audios, **kwargs) -> list[Response]:
        ...

    @abstractmethod
    async def stream_chat(self, messages, ...) -> AsyncGenerator[str, None]:
        ...

    @abstractmethod
    async def get_scores(self, batch_input, **kwargs) -> list[float]:
        ...

The ChatModel facade bridges synchronous and asynchronous worlds:

class ChatModel:
    def __init__(self, args=None):
        # Select engine based on infer_backend
        if model_args.infer_backend == EngineName.HF:
            self.engine = HuggingfaceEngine(...)
        elif model_args.infer_backend == EngineName.VLLM:
            self.engine = VllmEngine(...)
        # ...

        # Background event loop for async-to-sync bridging
        self._loop = asyncio.new_event_loop()
        self._thread = Thread(target=_start_background_loop, args=(self._loop,), daemon=True)
        self._thread.start()

    def chat(self, messages, ...) -> list[Response]:
        task = asyncio.run_coroutine_threadsafe(self.achat(...), self._loop)
        return task.result()

The concurrency model uses an asyncio.Semaphore to limit concurrent requests to the model:

self.semaphore = asyncio.Semaphore(int(os.getenv("MAX_CONCURRENT", "1")))

async def chat(self, messages, ...):
    async with self.semaphore:
        return await asyncio.to_thread(self._chat, ...)

This prevents out-of-memory errors from multiple concurrent generation requests while allowing the event loop to remain responsive.

The Response dataclass provides a structured output that includes both the generated text and metadata:

@dataclass
class Response:
    response_text: str
    response_length: int
    prompt_length: int
    finish_reason: Literal["stop", "length"]

The v1 architecture simplifies the design by:

Using AsyncTextIteratorStreamer for native async iteration without thread bridging.
Delegating model loading and rendering to separate ModelEngine and Renderer components.
Using the plugin system for backend selection via SampleBackend.

class BaseSampler:
    def __init__(self, args, model_args, model, renderer):
        if args.sample_backend == SampleBackend.HF:
            self.engine = HuggingFaceEngine(args, model_args, model, renderer)

    async def generate(self, messages, tools=None) -> AsyncGenerator[str, None]:
        async for token in self.engine.generate(messages, tools):
            yield token

The multimodal input handling in the HuggingFace engine follows a placeholder-based injection pattern where media tokens are prepended to the first message if not already present:

if images is not None:
    if not any(IMAGE_PLACEHOLDER in message["content"] for message in messages):
        messages[0]["content"] = IMAGE_PLACEHOLDER * len(images) + messages[0]["content"]

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment