Principle:Hiyouga LLaMA Factory Inference Engine Architecture
| Knowledge Sources | |
|---|---|
| Domains | Software Architecture, NLP |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
The inference engine architecture in LLaMA-Factory implements a backend-agnostic abstraction layer that enables transparent switching between HuggingFace Transformers, vLLM, SGLang, and KTransformers inference backends through a unified chat and scoring API.
Description
LLaMA-Factory serves fine-tuned models through multiple inference backends, each optimized for different deployment scenarios. The architecture uses the strategy pattern where an abstract BaseEngine defines the interface, and concrete engine implementations handle backend-specific details. A ChatModel facade provides both synchronous and asynchronous access to any backend.
The architecture spans two generations:
v0 (legacy) architecture:
BaseEngine(abstract): Defines the async interface with three core methods:chat()for batch generation,stream_chat()for token-by-token streaming, andget_scores()for reward model scoring.HuggingfaceEngine: Usesmodel.generate()withGenerationConfig, processes multimodal inputs (images, videos, audios), and supports placeholder-based media injection. UsesTextIteratorStreamerfor streaming.VllmEngine: Wraps vLLM'sAsyncLLMEnginewithSamplingParamsfor high-throughput batched inference with continuous batching and PagedAttention.SGLangEngine: Integrates SGLang's runtime for optimized inference with RadixAttention.KTransformersEngine: Uses KTransformers for heterogeneous CPU+GPU inference of large sparse models (e.g., DeepSeek-V3).ChatModel: The user-facing facade that selects the appropriate engine based oninfer_backend, runs an async event loop in a background thread, and exposes both sync (chat,stream_chat,get_scores) and async (achat,astream_chat,aget_scores) methods.
v1 architecture:
BaseEngine(abstract): Simplified interface withgenerate()for streaming andbatch_infer()for batch processing.HuggingFaceEngine: UsesAsyncTextIteratorStreamerfor native async streaming, operates on theDistributedInterfacefor device management.BaseSampler: Wraps the engine and dispatches to the configured backend based onSampleBackend.SyncSampler(CLI sampler): ExtendsBaseSamplerwith a synchronous interface for interactive CLI chat, bridging the async engine to synchronous iteration via a background event loop.
Usage
The inference engine architecture is used in:
- CLI chat (
llamafactory-cli chat): Interactive command-line conversations using stream_chat for real-time output. - API serving (
llamafactory-cli api): OpenAI-compatible REST API that delegates to the selected engine. - Web UI chat (
llamafactory-cli webchat): Browser-based chat interface. - Programmatic access: Direct instantiation of
ChatModelfor integration into custom applications.
Backend selection is controlled by the infer_backend parameter: "huggingface", "vllm", "sglang", or "ktransformers".
Theoretical Basis
The architecture implements the strategy pattern combined with the facade pattern:
ChatModel (Facade)
|
+-- BaseEngine (Strategy Interface)
|
+-- HuggingfaceEngine (HF Transformers)
+-- VllmEngine (vLLM)
+-- SGLangEngine (SGLang)
+-- KTransformersEngine (KTransformers)
The BaseEngine abstract class defines the contract:
class BaseEngine(ABC):
name: EngineName
model: Union[PreTrainedModel, AsyncLLMEngine]
tokenizer: PreTrainedTokenizer
can_generate: bool
template: Template
generating_args: dict[str, Any]
@abstractmethod
async def chat(self, messages, system, tools, images, videos, audios, **kwargs) -> list[Response]:
...
@abstractmethod
async def stream_chat(self, messages, ...) -> AsyncGenerator[str, None]:
...
@abstractmethod
async def get_scores(self, batch_input, **kwargs) -> list[float]:
...
The ChatModel facade bridges synchronous and asynchronous worlds:
class ChatModel:
def __init__(self, args=None):
# Select engine based on infer_backend
if model_args.infer_backend == EngineName.HF:
self.engine = HuggingfaceEngine(...)
elif model_args.infer_backend == EngineName.VLLM:
self.engine = VllmEngine(...)
# ...
# Background event loop for async-to-sync bridging
self._loop = asyncio.new_event_loop()
self._thread = Thread(target=_start_background_loop, args=(self._loop,), daemon=True)
self._thread.start()
def chat(self, messages, ...) -> list[Response]:
task = asyncio.run_coroutine_threadsafe(self.achat(...), self._loop)
return task.result()
The concurrency model uses an asyncio.Semaphore to limit concurrent requests to the model:
self.semaphore = asyncio.Semaphore(int(os.getenv("MAX_CONCURRENT", "1")))
async def chat(self, messages, ...):
async with self.semaphore:
return await asyncio.to_thread(self._chat, ...)
This prevents out-of-memory errors from multiple concurrent generation requests while allowing the event loop to remain responsive.
The Response dataclass provides a structured output that includes both the generated text and metadata:
@dataclass
class Response:
response_text: str
response_length: int
prompt_length: int
finish_reason: Literal["stop", "length"]
The v1 architecture simplifies the design by:
- Using
AsyncTextIteratorStreamerfor native async iteration without thread bridging. - Delegating model loading and rendering to separate
ModelEngineandRenderercomponents. - Using the plugin system for backend selection via
SampleBackend.
class BaseSampler:
def __init__(self, args, model_args, model, renderer):
if args.sample_backend == SampleBackend.HF:
self.engine = HuggingFaceEngine(args, model_args, model, renderer)
async def generate(self, messages, tools=None) -> AsyncGenerator[str, None]:
async for token in self.engine.generate(messages, tools):
yield token
The multimodal input handling in the HuggingFace engine follows a placeholder-based injection pattern where media tokens are prepended to the first message if not already present:
if images is not None:
if not any(IMAGE_PLACEHOLDER in message["content"] for message in messages):
messages[0]["content"] = IMAGE_PLACEHOLDER * len(images) + messages[0]["content"]
Related Pages
- Implementation:Hiyouga_LLaMA_Factory_Base_Engine
- Implementation:Hiyouga_LLaMA_Factory_Chat_Model
- Implementation:Hiyouga_LLaMA_Factory_HfChatEngine
- Implementation:Hiyouga_LLaMA_Factory_KT_Engine
- Implementation:Hiyouga_LLaMA_Factory_SGLang_Engine
- Implementation:Hiyouga_LLaMA_Factory_VLLM_Engine
- Implementation:Hiyouga_LLaMA_Factory_V1_Inference_Engine
- Implementation:Hiyouga_LLaMA_Factory_V1_Base_Sampler
- Implementation:Hiyouga_LLaMA_Factory_V1_CLI_Sampler