Workflow:Sgl project Sglang Offline Batch Inference
| Knowledge Sources | |
|---|---|
| Domains | LLM_Inference, Batch_Processing |
| Last Updated | 2026-02-09 00:00 GMT |
Overview
End-to-end process for running offline batch inference on large language models using the SGLang Engine API without launching a persistent server.
Description
This workflow covers the standard procedure for performing high-throughput batch text generation using SGLang's offline Engine. The Engine is instantiated directly in Python, processes a list of prompts in a single call, and returns all generated outputs. This approach is ideal for batch workloads where latency per-request is less critical than aggregate throughput. The workflow supports both synchronous and asynchronous generation patterns, configurable sampling parameters, and speculative decoding for accelerated generation.
Usage
Execute this workflow when you have a collection of prompts to process offline (not in a live serving scenario) and want to maximize throughput. Typical use cases include dataset annotation, synthetic data generation, bulk text completion, and evaluation benchmarks. This workflow requires GPU resources and a HuggingFace model path.
Execution Steps
Step 1: Configure Server Arguments
Define the model path, tensor parallelism size, and other engine configuration parameters. These arguments mirror the server launch arguments but are passed directly to the Engine constructor. Key settings include model path, quantization method, memory fraction, and GPU configuration.
Key considerations:
- Specify the correct model path (HuggingFace hub ID or local path)
- Set tensor parallelism size matching available GPUs
- Configure memory fraction for optimal KV cache allocation
Step 2: Initialize the SGLang Engine
Instantiate the SGLang Engine with the configured arguments. The Engine loads the model weights, initializes the KV cache, compiles CUDA graphs, and starts internal worker processes. This is a one-time cost that amortizes over many generation calls.
Key considerations:
- The Engine uses multiprocessing with spawn, so code must be guarded by __main__
- Engine initialization downloads model weights on first run
- For speculative decoding, provide draft model path and speculative parameters
Step 3: Prepare Prompts and Sampling Parameters
Assemble the list of input prompts and define sampling parameters such as temperature, top_p, top_k, max_new_tokens, and stop sequences. Prompts can be raw text strings or pre-tokenized token ID lists.
Key considerations:
- Sampling parameters are passed as a dictionary
- For deterministic output, set temperature to 0
- Prompts can be batched in a single list for efficient processing
Step 4: Execute Batch Generation
Call the Engine's generate method with the prompt list and sampling parameters. The Engine internally batches requests, manages continuous batching and paged attention, and returns results for all prompts. For non-blocking workloads, use async_generate with asyncio for concurrent task submission.
Key considerations:
- Synchronous generate blocks until all prompts complete
- Async generate allows overlapping computation and I/O
- Each output contains the generated text and optional metadata (token IDs, logprobs)
Step 5: Process and Collect Results
Iterate over the returned outputs to extract generated text, token counts, and any requested metadata. Results are returned in the same order as the input prompts.
Key considerations:
- Output format is a list of dictionaries with 'text' key
- Additional fields available when requested (e.g., logprobs, token IDs)
Step 6: Shutdown the Engine
Explicitly shut down the Engine to release GPU memory and terminate worker processes. This is important for scripts that continue execution after generation or run multiple engine instances sequentially.
Key considerations:
- Call llm.shutdown() to cleanly release resources
- Required when running multiple engines in the same process lifecycle