Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Hiyouga LLaMA Factory SGLang Engine

From Leeroopedia
Revision as of 15:07, 16 February 2026 by Admin (talk | contribs) (Auto-imported from implementations/Hiyouga_LLaMA_Factory_SGLang_Engine.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


Knowledge Sources
Domains Inference, High-Throughput Serving
Last Updated 2026-02-06 19:00 GMT

Overview

SGLang Engine is an inference engine implementation that launches an SGLang HTTP server as a subprocess and communicates with it via streaming HTTP requests.

Description

The SGLangEngine class starts an SGLang server process using launch_server_cmd with configurable tensor parallelism, memory fraction, context length, and LoRA adapter support. It communicates with the server via HTTP POST requests to the /generate endpoint, parsing streaming SSE (Server-Sent Events) JSON responses. The engine handles multimodal input placeholders for images, videos, and audios, processes messages through the template's multimodal plugin, and builds sampling parameters from both default and per-request generation settings. Server cleanup is managed via atexit registration and a __del__ method. The engine does not support reward model scoring (get_scores raises NotImplementedError).

Usage

Use this engine for high-throughput inference by setting --infer_backend sglang. It is particularly suitable for production serving scenarios where SGLang's RadixAttention and continuous batching optimizations are beneficial.

Code Reference

Source Location

Signature

class SGLangEngine(BaseEngine):
    def __init__(
        self,
        model_args: "ModelArguments",
        data_args: "DataArguments",
        finetuning_args: "FinetuningArguments",
        generating_args: "GeneratingArguments",
    ) -> None: ...

    def _cleanup_server(self): ...

    async def _generate(
        self,
        messages: list[dict[str, str]],
        system: Optional[str] = None,
        tools: Optional[str] = None,
        images: Optional[list["ImageInput"]] = None,
        videos: Optional[list["VideoInput"]] = None,
        audios: Optional[list["AudioInput"]] = None,
        **input_kwargs,
    ) -> AsyncIterator[dict[str, Any]]: ...

    async def chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> list["Response"]: ...
    async def stream_chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> AsyncGenerator[str, None]: ...
    async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...

Import

from llamafactory.chat.sglang_engine import SGLangEngine

I/O Contract

Inputs

Name Type Required Description
model_args ModelArguments Yes Model configuration including sglang_maxlen, sglang_mem_fraction, sglang_tp_size, sglang_lora_backend
data_args DataArguments Yes Data configuration for template setup
finetuning_args FinetuningArguments Yes Finetuning stage ("sft" enables generation)
generating_args GeneratingArguments Yes Default generation parameters (temperature, top_p, top_k, etc.)
messages list[dict[str, str]] Yes Chat messages for generation
images list[ImageInput] No Image inputs for multimodal models
videos list[VideoInput] No Video inputs for multimodal models
audios list[AudioInput] No Audio inputs for multimodal models

Outputs

Name Type Description
list[Response] list[Response] Generated response with text, completion/prompt tokens, and finish reason
AsyncGenerator[str, None] async generator Token-by-token delta text streaming output
get_scores NotImplementedError Reward model scoring is not supported

Usage Examples

from llamafactory.chat import ChatModel

# Use SGLang backend for high-throughput serving
chat_model = ChatModel(args={
    "model_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
    "template": "llama2",
    "infer_backend": "sglang",
    "sglang_maxlen": 4096,
    "sglang_tp_size": 2,
})

messages = [{"role": "user", "content": "Tell me a story."}]
responses = chat_model.chat(messages)
print(responses[0].response_text)

# Streaming
for token in chat_model.stream_chat(messages):
    print(token, end="", flush=True)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment