Implementation:Hiyouga LLaMA Factory Chat Model

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Inference, API
Last Updated	2026-02-06 19:00 GMT

Overview

Chat Model is the primary user-facing inference class that provides a unified sync/async interface over multiple inference backends.

Description

The ChatModel class acts as a facade over the engine layer. During initialization, it parses inference arguments via get_infer_args, selects the appropriate backend engine (HuggingFace, vLLM, SGLang, or KTransformers) based on the infer_backend configuration, and starts a background asyncio event loop thread. It exposes three pairs of sync/async methods: chat/achat for batch generation, stream_chat/astream_chat for token-by-token streaming, and get_scores/aget_scores for reward model scoring. Synchronous methods use asyncio.run_coroutine_threadsafe to bridge to the async engine. The module also provides run_chat() for an interactive CLI chat loop with history management.

Usage

Use ChatModel as the primary entry point for inference in both programmatic and server contexts. It is used by the API server (app.py) for HTTP-based inference and by run_chat() for interactive command-line use.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/chat/chat_model.py
Lines: 1-210

Signature

class ChatModel:
    def __init__(self, args: Optional[dict[str, Any]] = None) -> None: ...

    def chat(
        self,
        messages: list[dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
        images: Optional[list["ImageInput"]] = None,
        videos: Optional[list["VideoInput"]] = None,
        audios: Optional[list["AudioInput"]] = None,
        **input_kwargs,
    ) -> list["Response"]: ...

    async def achat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> list["Response"]: ...

    def stream_chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> Generator[str, None, None]: ...

    async def astream_chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> AsyncGenerator[str, None]: ...

    def get_scores(self, batch_input: list[str], **input_kwargs) -> list[float]: ...

    async def aget_scores(self, batch_input: list[str], **input_kwargs) -> list[float]: ...

def run_chat() -> None: ...

Import

from llamafactory.chat import ChatModel
from llamafactory.chat.chat_model import run_chat

I/O Contract

Inputs

Name	Type	Required	Description
args	dict[str, Any]	No	Configuration dictionary passed to get_infer_args; if None, parsed from command line
messages	list[dict[str, str]]	Yes	Chat messages with "role" and "content" keys
system	str	No	System prompt
tools	str	No	JSON-serialized tool definitions
images	list[ImageInput]	No	Image inputs for multimodal models
videos	list[VideoInput]	No	Video inputs for multimodal models
audios	list[AudioInput]	No	Audio inputs for multimodal models
batch_input	list[str]	Yes (for scoring)	Text inputs for reward model scoring

Outputs

Name	Type	Description
list[Response]	list[Response]	Generated responses from chat/achat
Generator[str, None, None]	sync generator	Token stream from stream_chat
AsyncGenerator[str, None]	async generator	Token stream from astream_chat
list[float]	list[float]	Reward model scores from get_scores/aget_scores

Usage Examples

from llamafactory.chat import ChatModel

# Initialize with custom arguments
chat_model = ChatModel(args={
    "model_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
    "template": "llama2",
    "infer_backend": "huggingface",
})

# Synchronous chat
messages = [{"role": "user", "content": "What is machine learning?"}]
responses = chat_model.chat(messages)
print(responses[0].response_text)

# Streaming chat
for token in chat_model.stream_chat(messages):
    print(token, end="", flush=True)

# Async chat (in async context)
responses = await chat_model.achat(messages)

Related Pages

Hiyouga_LLaMA_Factory_Base_Engine - Abstract interface implemented by all backends
Hiyouga_LLaMA_Factory_VLLM_Engine - vLLM backend engine
Hiyouga_LLaMA_Factory_SGLang_Engine - SGLang backend engine
Hiyouga_LLaMA_Factory_KT_Engine - KTransformers backend engine
Hiyouga_LLaMA_Factory_API_App - API server that uses ChatModel

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment