Implementation:Hiyouga LLaMA Factory V1 Inference Engine
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Text Generation, Async Programming |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
BaseEngine and HuggingFaceEngine define the abstract inference engine interface and its HuggingFace implementation for asynchronous streaming text generation.
Description
The inference engine module provides BaseEngine, an abstract base class (ABC) that defines the contract for inference backends with abstract generate and batch_infer methods. HuggingFaceEngine implements this interface using the HuggingFace Transformers library. For streaming generation, it uses AsyncTextIteratorStreamer with a background thread that runs model.generate(), yielding tokens asynchronously as they are produced. Concurrency is controlled via an asyncio.Semaphore (configurable through the MAX_CONCURRENT environment variable). The engine operates under torch.inference_mode() for optimal inference performance.
Usage
HuggingFaceEngine is not typically instantiated directly. Instead, it is created by BaseSampler when the sample backend is configured as SampleBackend.HF. Use it indirectly through the sampler's generate and batch_infer methods. To add a new inference backend, subclass BaseEngine and implement the abstract methods.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/v1/core/utils/inference_engine.py
- Lines: 1-121
Signature
class BaseEngine(ABC):
@abstractmethod
def __init__(
self,
args: SampleArguments,
model_args: ModelArguments,
model: HFModel,
renderer: Renderer,
) -> None: ...
@abstractmethod
async def generate(
self, messages: list[Message], tools: str | None = None
) -> AsyncGenerator[str, None]: ...
@abstractmethod
async def batch_infer(self, dataset: TorchDataset) -> list[Sample]: ...
class HuggingFaceEngine(BaseEngine):
def __init__(
self,
args: SampleArguments,
model_args: ModelArguments,
model: HFModel,
renderer: Renderer,
) -> None: ...
@torch.inference_mode()
async def generate(
self, messages: list[Message], tools: str | None = None
) -> AsyncGenerator[str, None]: ...
async def batch_infer(self, dataset: TorchDataset) -> list[Sample]: ...
Import
from llamafactory.v1.core.utils.inference_engine import BaseEngine, HuggingFaceEngine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| args | SampleArguments | Yes | Sample configuration including max_new_tokens for generation length control. |
| model_args | ModelArguments | Yes | Model configuration arguments. |
| model | HFModel | Yes | The HuggingFace model instance for generation. |
| renderer | Renderer | Yes | The renderer for converting messages to model inputs via render_messages. |
| messages (generate) | list[Message] | Yes | List of conversation messages to generate a response for. |
| tools (generate) | str or None | No | Optional tools string for tool-augmented generation. |
| dataset (batch_infer) | TorchDataset | Yes | A dataset for batch inference (not yet implemented in HuggingFaceEngine). |
Outputs
| Name | Type | Description |
|---|---|---|
| generate return | AsyncGenerator[str, None] | Asynchronous stream of generated token strings, yielded one at a time via AsyncTextIteratorStreamer. |
| batch_infer return | list[Sample] | List of inferred samples (not yet implemented; raises NotImplementedError). |
Usage Examples
from llamafactory.v1.core.utils.inference_engine import HuggingFaceEngine
# Typically created via BaseSampler, but can be used directly:
engine = HuggingFaceEngine(
args=sample_args,
model_args=model_args,
model=model,
renderer=renderer,
)
# Streaming generation
async for token in engine.generate(messages=[
{"role": "user", "content": [{"type": "text", "value": "Hello!"}]}
]):
print(token, end="", flush=True)
Related Pages
- Hiyouga_LLaMA_Factory_V1_Base_Sampler - The sampler that creates and delegates to inference engines.
- Hiyouga_LLaMA_Factory_V1_Rendering - The Renderer used for converting messages to model inputs.
- Hiyouga_LLaMA_Factory_V1_Model_Engine - Provides the model passed to the engine.