Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory HfChatEngine

From Leeroopedia
Revision as of 15:06, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Hiyouga_LLaMA_Factory_HfChatEngine.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Inference Engine, Chat
Last Updated 2026-02-06 19:00 GMT

Overview

Concrete HuggingFace Transformers inference engine for chat, streaming, and reward scoring provided by LLaMA Factory.

Description

The HuggingfaceEngine class extends BaseEngine to provide the default inference backend for LLaMA Factory. It loads a model and tokenizer via the framework's model loading utilities, applies the appropriate chat template, and provides three core capabilities: batch chat generation, streaming chat generation, and reward model scoring. The engine supports full multimodal input (images, videos, audios) through the template's multimodal plugin system.

Internally, _process_args handles all input preparation including multimodal placeholder injection, template-based prompt encoding, token ID expansion for media tokens, and generation configuration assembly. The synchronous _chat and _stream_chat methods run under torch.inference_mode(), while the async public methods (chat, stream_chat, get_scores) delegate to threads via asyncio.to_thread with a semaphore for concurrency control.

Usage

This engine is instantiated when the user selects the HuggingFace backend (the default). It is used by the API server, CLI chat interface, and web UI for inference. It is imported when EngineName.HF is selected or when no specific engine is configured.

Code Reference

Source Location

Signature

class HuggingfaceEngine(BaseEngine):
    def __init__(
        self,
        model_args: "ModelArguments",
        data_args: "DataArguments",
        finetuning_args: "FinetuningArguments",
        generating_args: "GeneratingArguments",
    ) -> None: ...

    @staticmethod
    def _process_args(
        model: "PreTrainedModel",
        tokenizer: "PreTrainedTokenizer",
        processor: Optional["ProcessorMixin"],
        template: "Template",
        generating_args: dict[str, Any],
        messages: list[dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
        images: Optional[list["ImageInput"]] = None,
        videos: Optional[list["VideoInput"]] = None,
        audios: Optional[list["AudioInput"]] = None,
        input_kwargs: Optional[dict[str, Any]] = {},
    ) -> tuple[dict[str, Any], int]: ...

    async def chat(self, messages, system, tools, images, videos, audios, **input_kwargs) -> list["Response"]: ...
    async def stream_chat(self, messages, system, tools, images, videos, audios, **input_kwargs) -> AsyncGenerator[str, None]: ...
    async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...

Import

from llamafactory.chat.hf_engine import HuggingfaceEngine

I/O Contract

Inputs

Name Type Required Description
model_args ModelArguments Yes Model path, quantization, and loading configuration
data_args DataArguments Yes Template name and data processing configuration
finetuning_args FinetuningArguments Yes Training stage (determines SFT vs reward model mode)
generating_args GeneratingArguments Yes Default generation parameters (temperature, top_p, max_new_tokens)
messages list[dict[str, str]] Yes Chat messages with role and content fields
system str No System prompt override
tools str No Tool descriptions for function calling
images list[ImageInput] No Image inputs for multimodal models
videos list[VideoInput] No Video inputs for multimodal models
audios list[AudioInput] No Audio inputs for multimodal models

Outputs

Name Type Description
chat result list[Response] List of Response objects with response_text, response_length, prompt_length, and finish_reason
stream result AsyncGenerator[str] Yields token strings as they are generated
scores list[float] Reward model scores for input sequences

Usage Examples

from llamafactory.chat.hf_engine import HuggingfaceEngine

# Initialize the engine
engine = HuggingfaceEngine(model_args, data_args, finetuning_args, generating_args)

# Synchronous-style chat (within async context)
responses = await engine.chat(
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    system="You are a helpful assistant.",
)
print(responses[0].response_text)

# Streaming chat
async for token in engine.stream_chat(
    messages=[{"role": "user", "content": "Tell me a story."}],
):
    print(token, end="", flush=True)

# Reward scoring (requires reward model)
scores = await engine.get_scores(
    batch_input=["Good response", "Bad response"],
)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment