Implementation:Hiyouga LLaMA Factory VLLM Engine

Knowledge Sources	Hiyouga_LLaMA_Factory
Domains	Inference, High-Throughput Serving
Last Updated	2026-02-06 19:00 GMT

Overview

VLLM Engine is the vLLM-based async inference engine that provides production-grade high-throughput text generation with PagedAttention and continuous batching.

Description

The VllmEngine class initializes a vLLM AsyncLLMEngine with configurable tensor parallelism (auto-detected from device count), GPU memory utilization, LoRA adapter support, and multimodal input limits. It handles GPTQ quantization dtype overrides (forcing float16 for GPTQ models) and applies Yi-VL projector patching when detected. The _generate method constructs SamplingParams from both default and per-request generation settings, prepares multimodal data (images, videos, audios) via the template's multimodal plugin, and returns an async iterator of RequestOutput objects. The chat method collects all outputs from the iterator, while stream_chat yields delta text incrementally. Reward model scoring (get_scores) is not supported.

Usage

Use this engine for production inference by setting --infer_backend vllm. It is the recommended choice for high-throughput serving scenarios with support for LoRA adapters, multimodal models, and multiple return sequences.

Code Reference

Source Location

Repository: Hiyouga_LLaMA_Factory
File: src/llamafactory/chat/vllm_engine.py
Lines: 1-271

Signature

class VllmEngine(BaseEngine):
    def __init__(
        self,
        model_args: "ModelArguments",
        data_args: "DataArguments",
        finetuning_args: "FinetuningArguments",
        generating_args: "GeneratingArguments",
    ) -> None: ...

    async def _generate(
        self,
        messages: list[dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
        images: Optional[list["ImageInput"]] = None,
        videos: Optional[list["VideoInput"]] = None,
        audios: Optional[list["AudioInput"]] = None,
        **input_kwargs,
    ) -> AsyncIterator["RequestOutput"]: ...

    async def chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> list["Response"]: ...
    async def stream_chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> AsyncGenerator[str, None]: ...
    async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...

Import

from llamafactory.chat.vllm_engine import VllmEngine

I/O Contract

Inputs

Name	Type	Required	Description
model_args	ModelArguments	Yes	Model configuration including vllm_maxlen, vllm_gpu_util, vllm_enforce_eager, vllm_max_lora_rank, vllm_config
data_args	DataArguments	Yes	Data configuration for template setup
finetuning_args	FinetuningArguments	Yes	Finetuning stage ("sft" enables generation)
generating_args	GeneratingArguments	Yes	Default generation parameters
messages	list[dict[str, str]]	Yes	Chat messages for generation
images	list[ImageInput]	No	Image inputs (regularized via mm_plugin)
videos	list[VideoInput]	No	Video inputs (regularized via mm_plugin)
audios	list[AudioInput]	No	Audio inputs (regularized via mm_plugin)
num_return_sequences	int	No	Number of completions to generate (default: 1)

Outputs

Name	Type	Description
list[Response]	list[Response]	Generated responses with text, token IDs length, prompt token IDs length, and finish reason
AsyncGenerator[str, None]	async generator	Token-by-token delta text streaming output
get_scores	NotImplementedError	Reward model scoring is not supported

Usage Examples

from llamafactory.chat import ChatModel

# Use vLLM backend for high-throughput inference
chat_model = ChatModel(args={
    "model_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
    "template": "llama2",
    "infer_backend": "vllm",
    "vllm_maxlen": 4096,
    "vllm_gpu_util": 0.9,
})

# Generate multiple completions
messages = [{"role": "user", "content": "Write a haiku about coding."}]
responses = chat_model.chat(messages, num_return_sequences=3)
for resp in responses:
    print(resp.response_text)

# Streaming
for token in chat_model.stream_chat(messages):
    print(token, end="", flush=True)

Related Pages

Hiyouga_LLaMA_Factory_Base_Engine - Abstract base class this engine implements
Hiyouga_LLaMA_Factory_Chat_Model - Facade that selects and delegates to this engine
Hiyouga_LLaMA_Factory_SGLang_Engine - Alternative SGLang-based engine
Hiyouga_LLaMA_Factory_KT_Engine - Alternative KTransformers-based engine

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment