Implementation:Hiyouga LLaMA Factory SGLang Engine
| Knowledge Sources | |
|---|---|
| Domains | Inference, High-Throughput Serving |
| Last Updated | 2026-02-06 19:00 GMT |
Overview
SGLang Engine is an inference engine implementation that launches an SGLang HTTP server as a subprocess and communicates with it via streaming HTTP requests.
Description
The SGLangEngine class starts an SGLang server process using launch_server_cmd with configurable tensor parallelism, memory fraction, context length, and LoRA adapter support. It communicates with the server via HTTP POST requests to the /generate endpoint, parsing streaming SSE (Server-Sent Events) JSON responses. The engine handles multimodal input placeholders for images, videos, and audios, processes messages through the template's multimodal plugin, and builds sampling parameters from both default and per-request generation settings. Server cleanup is managed via atexit registration and a __del__ method. The engine does not support reward model scoring (get_scores raises NotImplementedError).
Usage
Use this engine for high-throughput inference by setting --infer_backend sglang. It is particularly suitable for production serving scenarios where SGLang's RadixAttention and continuous batching optimizations are beneficial.
Code Reference
Source Location
- Repository: Hiyouga_LLaMA_Factory
- File: src/llamafactory/chat/sglang_engine.py
- Lines: 1-289
Signature
class SGLangEngine(BaseEngine):
def __init__(
self,
model_args: "ModelArguments",
data_args: "DataArguments",
finetuning_args: "FinetuningArguments",
generating_args: "GeneratingArguments",
) -> None: ...
def _cleanup_server(self): ...
async def _generate(
self,
messages: list[dict[str, str]],
system: Optional[str] = None,
tools: Optional[str] = None,
images: Optional[list["ImageInput"]] = None,
videos: Optional[list["VideoInput"]] = None,
audios: Optional[list["AudioInput"]] = None,
**input_kwargs,
) -> AsyncIterator[dict[str, Any]]: ...
async def chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> list["Response"]: ...
async def stream_chat(self, messages, system=None, tools=None, images=None, videos=None, audios=None, **input_kwargs) -> AsyncGenerator[str, None]: ...
async def get_scores(self, batch_input, **input_kwargs) -> list[float]: ...
Import
from llamafactory.chat.sglang_engine import SGLangEngine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_args | ModelArguments | Yes | Model configuration including sglang_maxlen, sglang_mem_fraction, sglang_tp_size, sglang_lora_backend |
| data_args | DataArguments | Yes | Data configuration for template setup |
| finetuning_args | FinetuningArguments | Yes | Finetuning stage ("sft" enables generation) |
| generating_args | GeneratingArguments | Yes | Default generation parameters (temperature, top_p, top_k, etc.) |
| messages | list[dict[str, str]] | Yes | Chat messages for generation |
| images | list[ImageInput] | No | Image inputs for multimodal models |
| videos | list[VideoInput] | No | Video inputs for multimodal models |
| audios | list[AudioInput] | No | Audio inputs for multimodal models |
Outputs
| Name | Type | Description |
|---|---|---|
| list[Response] | list[Response] | Generated response with text, completion/prompt tokens, and finish reason |
| AsyncGenerator[str, None] | async generator | Token-by-token delta text streaming output |
| get_scores | NotImplementedError | Reward model scoring is not supported |
Usage Examples
from llamafactory.chat import ChatModel
# Use SGLang backend for high-throughput serving
chat_model = ChatModel(args={
"model_name_or_path": "meta-llama/Llama-2-7b-chat-hf",
"template": "llama2",
"infer_backend": "sglang",
"sglang_maxlen": 4096,
"sglang_tp_size": 2,
})
messages = [{"role": "user", "content": "Tell me a story."}]
responses = chat_model.chat(messages)
print(responses[0].response_text)
# Streaming
for token in chat_model.stream_chat(messages):
print(token, end="", flush=True)
Related Pages
- Hiyouga_LLaMA_Factory_Base_Engine - Abstract base class this engine implements
- Hiyouga_LLaMA_Factory_Chat_Model - Facade that selects and delegates to this engine
- Hiyouga_LLaMA_Factory_VLLM_Engine - Alternative vLLM-based engine
- Hiyouga_LLaMA_Factory_KT_Engine - Alternative KTransformers-based engine