Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory VLLM Engine

From Leeroopedia
Revision as of 15:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Hiyouga_LLaMA_Factory_VLLM_Engine.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Inference, High-Throughput Serving
Last Updated 2026-02-06 19:00 GMT

Overview

VLLM Engine is the vLLM-based async inference engine that provides production-grade high-throughput text generation with PagedAttention and continuous batching.

Description

The VllmEngine class initializes a vLLM AsyncLLMEngine with configurable tensor parallelism (auto-detected from device count), GPU memory utilization, LoRA adapter support, and multimodal input limits. It handles GPTQ quantization dtype overrides (forcing float16 for GPTQ models) and applies Yi-VL projector patching when detected. The _generate method constructs SamplingParams from both default and per-request generation settings, prepares multimodal data (images, videos, audios) via the template's multimodal plugin, and returns an async iterator of RequestOutput objects. The chat method collects all outputs from the iterator, while stream_chat yields delta text incrementally. Reward model scoring (get_scores) is not supported.

Usage

Use this engine for production inference by setting --infer_backend vllm. It is the recommended choice for high-throughput serving scenarios with support for LoRA adapters, multimodal models, and multiple return sequences.

Code Reference

Source Location

Signature

class VllmEngine(BaseEngine):
    def __init__(
        self,
        model_args: "ModelArguments",
        data_args: "DataArguments",
        finetuning_args: "FinetuningArguments",
        generating_args: "GeneratingArguments",
    ) -> None: ...

    async def _generate(
        self,
        messages: list[dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
        images: Optional[list["ImageInput"]] = None,
        videos: Optional[list["VideoInput"]] = None,
        audios: Optional[list["AudioInput"]] = None,
        **input_kwargs,
    ) -> AsyncIterator["RequestOutput"]: ...

    async def chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> list["Response"]: ...
    async def stream_chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> AsyncGenerator[str, None]: ...
    async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...

Import

from llamafactory.chat.vllm_engine import VllmEngine

I/O Contract

Inputs

Name Type Required Description
model_args ModelArguments Yes Model configuration including vllm_maxlen, vllm_gpu_util, vllm_enforce_eager, vllm_max_lora_rank, vllm_config
data_args DataArguments Yes Data configuration for template setup
finetuning_args FinetuningArguments Yes Finetuning stage ("sft" enables generation)
generating_args GeneratingArguments Yes Default generation parameters
messages list[dict[str, str]] Yes Chat messages for generation
images list[ImageInput] No Image inputs (regularized via mm_plugin)
videos list[VideoInput] No Video inputs (regularized via mm_plugin)
audios list[AudioInput] No Audio inputs (regularized via mm_plugin)
num_return_sequences int No Number of completions to generate (default: 1)

Outputs

Name Type Description
list[Response] list[Response] Generated responses with text, token IDs length, prompt token IDs length, and finish reason
AsyncGenerator[str, None] async generator Token-by-token delta text streaming output
get_scores NotImplementedError Reward model scoring is not supported

Usage Examples

from llamafactory.chat import ChatModel

# Use vLLM backend for high-throughput inference
chat_model = ChatModel(args={
    "model_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
    "template": "llama2",
    "infer_backend": "vllm",
    "vllm_maxlen": 4096,
    "vllm_gpu_util": 0.9,
})

# Generate multiple completions
messages = [{"role": "user", "content": "Write a haiku about coding."}]
responses = chat_model.chat(messages, num_return_sequences=3)
for resp in responses:
    print(resp.response_text)

# Streaming
for token in chat_model.stream_chat(messages):
    print(token, end="", flush=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment