Implementation:Hiyouga LLaMA Factory HfChatEngine
| Knowledge Sources | |
|---|---|
| Domains | Inference Engine, Chat |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
Concrete HuggingFace Transformers inference engine for chat, streaming, and reward scoring provided by LLaMA Factory.
Description
The HuggingfaceEngine class extends BaseEngine to provide the default inference backend for LLaMA Factory. It loads a model and tokenizer via the framework's model loading utilities, applies the appropriate chat template, and provides three core capabilities: batch chat generation, streaming chat generation, and reward model scoring. The engine supports full multimodal input (images, videos, audios) through the template's multimodal plugin system.
Internally, _process_args handles all input preparation including multimodal placeholder injection, template-based prompt encoding, token ID expansion for media tokens, and generation configuration assembly. The synchronous _chat and _stream_chat methods run under torch.inference_mode(), while the async public methods (chat, stream_chat, get_scores) delegate to threads via asyncio.to_thread with a semaphore for concurrency control.
Usage
This engine is instantiated when the user selects the HuggingFace backend (the default). It is used by the API server, CLI chat interface, and web UI for inference. It is imported when EngineName.HF is selected or when no specific engine is configured.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/chat/hf_engine.py
- Lines: 1-412
Signature
class HuggingfaceEngine(BaseEngine):
def __init__(
self,
model_args: "ModelArguments",
data_args: "DataArguments",
finetuning_args: "FinetuningArguments",
generating_args: "GeneratingArguments",
) -> None: ...
@staticmethod
def _process_args(
model: "PreTrainedModel",
tokenizer: "PreTrainedTokenizer",
processor: Optional["ProcessorMixin"],
template: "Template",
generating_args: dict[str, Any],
messages: list[dict[str, str]],
system: Optional[str] = None,
tools: Optional[str] = None,
images: Optional[list["ImageInput"]] = None,
videos: Optional[list["VideoInput"]] = None,
audios: Optional[list["AudioInput"]] = None,
input_kwargs: Optional[dict[str, Any]] = {},
) -> tuple[dict[str, Any], int]: ...
async def chat(self, messages, system, tools, images, videos, audios, **input_kwargs) -> list["Response"]: ...
async def stream_chat(self, messages, system, tools, images, videos, audios, **input_kwargs) -> AsyncGenerator[str, None]: ...
async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...
Import
from llamafactory.chat.hf_engine import HuggingfaceEngine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_args | ModelArguments | Yes | Model path, quantization, and loading configuration |
| data_args | DataArguments | Yes | Template name and data processing configuration |
| finetuning_args | FinetuningArguments | Yes | Training stage (determines SFT vs reward model mode) |
| generating_args | GeneratingArguments | Yes | Default generation parameters (temperature, top_p, max_new_tokens) |
| messages | list[dict[str, str]] | Yes | Chat messages with role and content fields |
| system | str | No | System prompt override |
| tools | str | No | Tool descriptions for function calling |
| images | list[ImageInput] | No | Image inputs for multimodal models |
| videos | list[VideoInput] | No | Video inputs for multimodal models |
| audios | list[AudioInput] | No | Audio inputs for multimodal models |
Outputs
| Name | Type | Description |
|---|---|---|
| chat result | list[Response] | List of Response objects with response_text, response_length, prompt_length, and finish_reason |
| stream result | AsyncGenerator[str] | Yields token strings as they are generated |
| scores | list[float] | Reward model scores for input sequences |
Usage Examples
from llamafactory.chat.hf_engine import HuggingfaceEngine
# Initialize the engine
engine = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)
# Synchronous-style chat (within async context)
responses = await engine.chat(
messages=[{"role": "user", "content": "Hello, how are you?"}],
system="You are a helpful assistant.",
)
print(responses[0].response_text)
# Streaming chat
async for token in engine.stream_chat(
messages=[{"role": "user", "content": "Tell me a story."}],
):
print(token, end="", flush=True)
# Reward scoring (requires reward model)
scores = await engine.get_scores(
batch_input=["Good response", "Bad response"],
)
Related Pages
- Hiyouga_LLaMA_Factory_Chat_Template - Template system used for prompt encoding
- Hiyouga_LLaMA_Factory_Multimodal_Plugin - Multimodal processing plugins used by _process_args
- Hiyouga_LLaMA_Factory_Constants - EngineName.HF constant