Implementation:Sgl project Sglang Engine Init
| Knowledge Sources | |
|---|---|
| Domains | LLM_Serving, Inference_Engine |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
Concrete tool for initializing the SGLang inference engine with multi-process architecture provided by the SGLang runtime.
Description
The Engine class is the main entry point for programmatic LLM inference in SGLang. On initialization, it spawns TokenizerManager, Scheduler, and DetokenizerManager subprocesses, sets up ZMQ IPC communication, and registers automatic shutdown via atexit. It accepts either a ServerArgs object directly or keyword arguments that mirror ServerArgs fields.
Usage
Import Engine (or use sgl.Engine) when performing offline batch inference, embedding computation, or any programmatic model interaction without an HTTP server.
Code Reference
Source Location
- Repository: sglang
- File: python/sglang/srt/entrypoints/engine.py
- Lines: L118-204
Signature
class Engine(EngineBase):
def __init__(self, **kwargs):
"""
Args mirror ServerArgs fields. Key parameters:
model_path (str): HuggingFace model ID or local path.
log_level (str): Logging level (default: "error" for Engine).
server_args (ServerArgs): Direct ServerArgs object (alternative to kwargs).
tp_size (int): Tensor parallelism degree.
dtype (str): Weight data type.
quantization (Optional[str]): Quantization method.
mem_fraction_static (Optional[float]): GPU memory fraction for KV cache.
"""
Import
import sglang as sgl
# Or directly:
from sglang.srt.entrypoints.engine import Engine
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model_path | str | Yes (via kwargs or server_args) | HuggingFace model ID or local path |
| server_args | ServerArgs | No | Pre-constructed ServerArgs (alternative to kwargs) |
| log_level | str | No | Logging level (default: "error") |
| tp_size | int | No | Tensor parallelism degree (default: 1) |
| dtype | str | No | Weight data type (default: "auto") |
Outputs
| Name | Type | Description |
|---|---|---|
| Engine instance | Engine | Initialized engine with running subprocesses (TokenizerManager, Scheduler, DetokenizerManager) |
Usage Examples
Basic Initialization
import sglang as sgl
# Initialize with kwargs (simplest form)
engine = sgl.Engine(model_path="meta-llama/Llama-3.1-8B-Instruct")
# Use the engine for generation...
output = engine.generate("What is AI?", {"max_new_tokens": 64})
# Shutdown when done
engine.shutdown()
Context Manager
import sglang as sgl
# Engine supports context manager for automatic shutdown
with sgl.Engine(model_path="meta-llama/Llama-3.1-8B-Instruct", tp_size=2) as engine:
output = engine.generate("Explain quantum computing.", {"max_new_tokens": 128})
print(output["text"])
# Engine is automatically shut down here
With Pre-Built ServerArgs
from sglang.srt.server_args import ServerArgs
from sglang.srt.entrypoints.engine import Engine
server_args = ServerArgs(
model_path="meta-llama/Llama-3.1-8B-Instruct",
tp_size=4,
dtype="bfloat16",
mem_fraction_static=0.9,
)
engine = Engine(server_args=server_args)