Implementation:Hiyouga LLaMA Factory HfChatEngine

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Inference Engine, Chat
Last Updated	2026-02-06 19:00 GMT

Overview

Concrete HuggingFace Transformers inference engine for chat, streaming, and reward scoring provided by LLaMA Factory.

Description

The HuggingfaceEngine class extends BaseEngine to provide the default inference backend for LLaMA Factory. It loads a model and tokenizer via the framework's model loading utilities, applies the appropriate chat template, and provides three core capabilities: batch chat generation, streaming chat generation, and reward model scoring. The engine supports full multimodal input (images, videos, audios) through the template's multimodal plugin system.

Internally, _process_args handles all input preparation including multimodal placeholder injection, template-based prompt encoding, token ID expansion for media tokens, and generation configuration assembly. The synchronous _chat and _stream_chat methods run under torch.inference_mode(), while the async public methods (chat, stream_chat, get_scores) delegate to threads via asyncio.to_thread with a semaphore for concurrency control.

Usage

This engine is instantiated when the user selects the HuggingFace backend (the default). It is used by the API server, CLI chat interface, and web UI for inference. It is imported when EngineName.HF is selected or when no specific engine is configured.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/chat/hf_engine.py
Lines: 1-412

Signature

class HuggingfaceEngine(BaseEngine):
    def __init__(
        self,
        model_args: "ModelArguments",
        data_args: "DataArguments",
        finetuning_args: "FinetuningArguments",
        generating_args: "GeneratingArguments",
    ) -> None: ...

    @staticmethod
    def _process_args(
        model: "PreTrainedModel",
        tokenizer: "PreTrainedTokenizer",
        processor: Optional["ProcessorMixin"],
        template: "Template",
        generating_args: dict[str, Any],
        messages: list[dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
        images: Optional[list["ImageInput"]] = None,
        videos: Optional[list["VideoInput"]] = None,
        audios: Optional[list["AudioInput"]] = None,
        input_kwargs: Optional[dict[str, Any]] = {},
    ) -> tuple[dict[str, Any], int]: ...

    async def chat(self, messages, system, tools, images, videos, audios, **input_kwargs) -> list["Response"]: ...
    async def stream_chat(self, messages, system, tools, images, videos, audios, **input_kwargs) -> AsyncGenerator[str, None]: ...
    async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...

Import

from llamafactory.chat.hf_engine import HuggingfaceEngine

I/O Contract

Inputs

Name	Type	Required	Description
model_args	ModelArguments	Yes	Model path, quantization, and loading configuration
data_args	DataArguments	Yes	Template name and data processing configuration
finetuning_args	FinetuningArguments	Yes	Training stage (determines SFT vs reward model mode)
generating_args	GeneratingArguments	Yes	Default generation parameters (temperature, top_p, max_new_tokens)
messages	list[dict[str, str]]	Yes	Chat messages with role and content fields
system	str	No	System prompt override
tools	str	No	Tool descriptions for function calling
images	list[ImageInput]	No	Image inputs for multimodal models
videos	list[VideoInput]	No	Video inputs for multimodal models
audios	list[AudioInput]	No	Audio inputs for multimodal models

Outputs

Name	Type	Description
chat result	list[Response]	List of Response objects with response_text, response_length, prompt_length, and finish_reason
stream result	AsyncGenerator[str]	Yields token strings as they are generated
scores	list[float]	Reward model scores for input sequences

Usage Examples

from llamafactory.chat.hf_engine import HuggingfaceEngine

# Initialize the engine
engine = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)

# Synchronous-style chat (within async context)
responses = await engine.chat(
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    system="You are a helpful assistant.",
)
print(responses[0].response_text)

# Streaming chat
async for token in engine.stream_chat(
    messages=[{"role": "user", "content": "Tell me a story."}],
):
    print(token, end="", flush=True)

# Reward scoring (requires reward model)
scores = await engine.get_scores(
    batch_input=["Good response", "Bad response"],
)

Related Pages

Hiyouga_LLaMA_Factory_Chat_Template - Template system used for prompt encoding
Hiyouga_LLaMA_Factory_Multimodal_Plugin - Multimodal processing plugins used by _process_args
Hiyouga_LLaMA_Factory_Constants - EngineName.HF constant

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment