Implementation:Hiyouga LLaMA Factory V1 Inference Engine

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Machine Learning, Text Generation, Async Programming
Last Updated	2026-02-06 19:00 GMT

Overview

BaseEngine and HuggingFaceEngine define the abstract inference engine interface and its HuggingFace implementation for asynchronous streaming text generation.

Description

The inference engine module provides BaseEngine, an abstract base class (ABC) that defines the contract for inference backends with abstract generate and batch_infer methods. HuggingFaceEngine implements this interface using the HuggingFace Transformers library. For streaming generation, it uses AsyncTextIteratorStreamer with a background thread that runs model.generate(), yielding tokens asynchronously as they are produced. Concurrency is controlled via an asyncio.Semaphore (configurable through the MAX_CONCURRENT environment variable). The engine operates under torch.inference_mode() for optimal inference performance.

Usage

HuggingFaceEngine is not typically instantiated directly. Instead, it is created by BaseSampler when the sample backend is configured as SampleBackend.HF. Use it indirectly through the sampler's generate and batch_infer methods. To add a new inference backend, subclass BaseEngine and implement the abstract methods.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/v1/core/utils/inference_engine.py
Lines: 1-121

Signature

class BaseEngine(ABC):
    @abstractmethod
    def __init__(
        self,
        args: SampleArguments,
        model_args: ModelArguments,
        model: HFModel,
        renderer: Renderer,
    ) -> None: ...

    @abstractmethod
    async def generate(
        self, messages: list[Message], tools: str | None = None
    ) -> AsyncGenerator[str, None]: ...

    @abstractmethod
    async def batch_infer(self, dataset: TorchDataset) -> list[Sample]: ...


class HuggingFaceEngine(BaseEngine):
    def __init__(
        self,
        args: SampleArguments,
        model_args: ModelArguments,
        model: HFModel,
        renderer: Renderer,
    ) -> None: ...

    @torch.inference_mode()
    async def generate(
        self, messages: list[Message], tools: str | None = None
    ) -> AsyncGenerator[str, None]: ...

    async def batch_infer(self, dataset: TorchDataset) -> list[Sample]: ...

Import

from llamafactory.v1.core.utils.inference_engine import BaseEngine, HuggingFaceEngine

I/O Contract

Inputs

Name	Type	Required	Description
args	SampleArguments	Yes	Sample configuration including max_new_tokens for generation length control.
model_args	ModelArguments	Yes	Model configuration arguments.
model	HFModel	Yes	The HuggingFace model instance for generation.
renderer	Renderer	Yes	The renderer for converting messages to model inputs via render_messages.
messages (generate)	list[Message]	Yes	List of conversation messages to generate a response for.
tools (generate)	str or None	No	Optional tools string for tool-augmented generation.
dataset (batch_infer)	TorchDataset	Yes	A dataset for batch inference (not yet implemented in HuggingFaceEngine).

Outputs

Name	Type	Description
generate return	AsyncGenerator[str, None]	Asynchronous stream of generated token strings, yielded one at a time via AsyncTextIteratorStreamer.
batch_infer return	list[Sample]	List of inferred samples (not yet implemented; raises NotImplementedError).

Usage Examples

from llamafactory.v1.core.utils.inference_engine import HuggingFaceEngine

# Typically created via BaseSampler, but can be used directly:
engine = HuggingFaceEngine(
    args=sample_args,
    model_args=model_args,
    model=model,
    renderer=renderer,
)

# Streaming generation
async for token in engine.generate(messages=[
    {"role": "user", "content": [{"type": "text", "value": "Hello!"}]}
]):
    print(token, end="", flush=True)

Related Pages

Hiyouga_LLaMA_Factory_V1_Base_Sampler - The sampler that creates and delegates to inference engines.
Hiyouga_LLaMA_Factory_V1_Rendering - The Renderer used for converting messages to model inputs.
Hiyouga_LLaMA_Factory_V1_Model_Engine - Provides the model passed to the engine.

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment