Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory V1 Inference Engine

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Text Generation, Async Programming
Last Updated 2026-02-06 19:00 GMT

Overview

BaseEngine and HuggingFaceEngine define the abstract inference engine interface and its HuggingFace implementation for asynchronous streaming text generation.

Description

The inference engine module provides BaseEngine, an abstract base class (ABC) that defines the contract for inference backends with abstract generate and batch_infer methods. HuggingFaceEngine implements this interface using the HuggingFace Transformers library. For streaming generation, it uses AsyncTextIteratorStreamer with a background thread that runs model.generate(), yielding tokens asynchronously as they are produced. Concurrency is controlled via an asyncio.Semaphore (configurable through the MAX_CONCURRENT environment variable). The engine operates under torch.inference_mode() for optimal inference performance.

Usage

HuggingFaceEngine is not typically instantiated directly. Instead, it is created by BaseSampler when the sample backend is configured as SampleBackend.HF. Use it indirectly through the sampler's generate and batch_infer methods. To add a new inference backend, subclass BaseEngine and implement the abstract methods.

Code Reference

Source Location

Signature

class BaseEngine(ABC):
    @abstractmethod
    def __init__(
        self,
        args: SampleArguments,
        model_args: ModelArguments,
        model: HFModel,
        renderer: Renderer,
    ) -> None: ...

    @abstractmethod
    async def generate(
        self, messages: list[Message], tools: str | None = None
    ) -> AsyncGenerator[str, None]: ...

    @abstractmethod
    async def batch_infer(self, dataset: TorchDataset) -> list[Sample]: ...


class HuggingFaceEngine(BaseEngine):
    def __init__(
        self,
        args: SampleArguments,
        model_args: ModelArguments,
        model: HFModel,
        renderer: Renderer,
    ) -> None: ...

    @torch.inference_mode()
    async def generate(
        self, messages: list[Message], tools: str | None = None
    ) -> AsyncGenerator[str, None]: ...

    async def batch_infer(self, dataset: TorchDataset) -> list[Sample]: ...

Import

from llamafactory.v1.core.utils.inference_engine import BaseEngine, HuggingFaceEngine

I/O Contract

Inputs

Name Type Required Description
args SampleArguments Yes Sample configuration including max_new_tokens for generation length control.
model_args ModelArguments Yes Model configuration arguments.
model HFModel Yes The HuggingFace model instance for generation.
renderer Renderer Yes The renderer for converting messages to model inputs via render_messages.
messages (generate) list[Message] Yes List of conversation messages to generate a response for.
tools (generate) str or None No Optional tools string for tool-augmented generation.
dataset (batch_infer) TorchDataset Yes A dataset for batch inference (not yet implemented in HuggingFaceEngine).

Outputs

Name Type Description
generate return AsyncGenerator[str, None] Asynchronous stream of generated token strings, yielded one at a time via AsyncTextIteratorStreamer.
batch_infer return list[Sample] List of inferred samples (not yet implemented; raises NotImplementedError).

Usage Examples

from llamafactory.v1.core.utils.inference_engine import HuggingFaceEngine

# Typically created via BaseSampler, but can be used directly:
engine = HuggingFaceEngine(
    args=sample_args,
    model_args=model_args,
    model=model,
    renderer=renderer,
)

# Streaming generation
async for token in engine.generate(messages=[
    {"role": "user", "content": [{"type": "text", "value": "Hello!"}]}
]):
    print(token, end="", flush=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment