Implementation:Hiyouga LLaMA Factory VLLM Engine
| Knowledge Sources | |
|---|---|
| Domains | Inference, High-Throughput Serving |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
VLLM Engine is the vLLM-based async inference engine that provides production-grade high-throughput text generation with PagedAttention and continuous batching.
Description
The VllmEngine class initializes a vLLM AsyncLLMEngine with configurable tensor parallelism (auto-detected from device count), GPU memory utilization, LoRA adapter support, and multimodal input limits. It handles GPTQ quantization dtype overrides (forcing float16 for GPTQ models) and applies Yi-VL projector patching when detected. The _generate method constructs SamplingParams from both default and per-request generation settings, prepares multimodal data (images, videos, audios) via the template's multimodal plugin, and returns an async iterator of RequestOutput objects. The chat method collects all outputs from the iterator, while stream_chat yields delta text incrementally. Reward model scoring (get_scores) is not supported.
Usage
Use this engine for production inference by setting --infer_backend vllm. It is the recommended choice for high-throughput serving scenarios with support for LoRA adapters, multimodal models, and multiple return sequences.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/chat/vllm_engine.py
- Lines: 1-271
Signature
class VllmEngine(BaseEngine):
def __init__(
self,
model_args: "ModelArguments",
data_args: "DataArguments",
finetuning_args: "FinetuningArguments",
generating_args: "GeneratingArguments",
) -> None: ...
async def _generate(
self,
messages: list[dict[str, str]],
system: Optional[str] = None,
tools: Optional[str] = None,
images: Optional[list["ImageInput"]] = None,
videos: Optional[list["VideoInput"]] = None,
audios: Optional[list["AudioInput"]] = None,
**input_kwargs,
) -> AsyncIterator["RequestOutput"]: ...
async def chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> list["Response"]: ...
async def stream_chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> AsyncGenerator[str, None]: ...
async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...
Import
from llamafactory.chat.vllm_engine import VllmEngine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_args | ModelArguments | Yes | Model configuration including vllm_maxlen, vllm_gpu_util, vllm_enforce_eager, vllm_max_lora_rank, vllm_config |
| data_args | DataArguments | Yes | Data configuration for template setup |
| finetuning_args | FinetuningArguments | Yes | Finetuning stage ("sft" enables generation) |
| generating_args | GeneratingArguments | Yes | Default generation parameters |
| messages | list[dict[str, str]] | Yes | Chat messages for generation |
| images | list[ImageInput] | No | Image inputs (regularized via mm_plugin) |
| videos | list[VideoInput] | No | Video inputs (regularized via mm_plugin) |
| audios | list[AudioInput] | No | Audio inputs (regularized via mm_plugin) |
| num_return_sequences | int | No | Number of completions to generate (default: 1) |
Outputs
| Name | Type | Description |
|---|---|---|
| list[Response] | list[Response] | Generated responses with text, token IDs length, prompt token IDs length, and finish reason |
| AsyncGenerator[str, None] | async generator | Token-by-token delta text streaming output |
| get_scores | NotImplementedError | Reward model scoring is not supported |
Usage Examples
from llamafactory.chat import ChatModel
# Use vLLM backend for high-throughput inference
chat_model = ChatModel(args={
"model_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
"template": "llama2",
"infer_backend": "vllm",
"vllm_maxlen": 4096,
"vllm_gpu_util": 0.9,
})
# Generate multiple completions
messages = [{"role": "user", "content": "Write a haiku about coding."}]
responses = chat_model.chat(messages, num_return_sequences=3)
for resp in responses:
print(resp.response_text)
# Streaming
for token in chat_model.stream_chat(messages):
print(token, end="", flush=True)
Related Pages
- Hiyouga_LLaMA_Factory_Base_Engine - Abstract base class this engine implements
- Hiyouga_LLaMA_Factory_Chat_Model - Facade that selects and delegates to this engine
- Hiyouga_LLaMA_Factory_SGLang_Engine - Alternative SGLang-based engine
- Hiyouga_LLaMA_Factory_KT_Engine - Alternative KTransformers-based engine