Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Sgl project Sglang Offline Batch Inference

From Leeroopedia



Knowledge Sources
Domains LLM_Inference, Batch_Processing
Last Updated 2026-02-09 00:00 GMT

Overview

End-to-end process for running offline batch inference on large language models using the SGLang Engine API without launching a persistent server.

Description

This workflow covers the standard procedure for performing high-throughput batch text generation using SGLang's offline Engine. The Engine is instantiated directly in Python, processes a list of prompts in a single call, and returns all generated outputs. This approach is ideal for batch workloads where latency per-request is less critical than aggregate throughput. The workflow supports both synchronous and asynchronous generation patterns, configurable sampling parameters, and speculative decoding for accelerated generation.

Usage

Execute this workflow when you have a collection of prompts to process offline (not in a live serving scenario) and want to maximize throughput. Typical use cases include dataset annotation, synthetic data generation, bulk text completion, and evaluation benchmarks. This workflow requires GPU resources and a HuggingFace model path.

Execution Steps

Step 1: Configure Server Arguments

Define the model path, tensor parallelism size, and other engine configuration parameters. These arguments mirror the server launch arguments but are passed directly to the Engine constructor. Key settings include model path, quantization method, memory fraction, and GPU configuration.

Key considerations:

  • Specify the correct model path (HuggingFace hub ID or local path)
  • Set tensor parallelism size matching available GPUs
  • Configure memory fraction for optimal KV cache allocation

Step 2: Initialize the SGLang Engine

Instantiate the SGLang Engine with the configured arguments. The Engine loads the model weights, initializes the KV cache, compiles CUDA graphs, and starts internal worker processes. This is a one-time cost that amortizes over many generation calls.

Key considerations:

  • The Engine uses multiprocessing with spawn, so code must be guarded by __main__
  • Engine initialization downloads model weights on first run
  • For speculative decoding, provide draft model path and speculative parameters

Step 3: Prepare Prompts and Sampling Parameters

Assemble the list of input prompts and define sampling parameters such as temperature, top_p, top_k, max_new_tokens, and stop sequences. Prompts can be raw text strings or pre-tokenized token ID lists.

Key considerations:

  • Sampling parameters are passed as a dictionary
  • For deterministic output, set temperature to 0
  • Prompts can be batched in a single list for efficient processing

Step 4: Execute Batch Generation

Call the Engine's generate method with the prompt list and sampling parameters. The Engine internally batches requests, manages continuous batching and paged attention, and returns results for all prompts. For non-blocking workloads, use async_generate with asyncio for concurrent task submission.

Key considerations:

  • Synchronous generate blocks until all prompts complete
  • Async generate allows overlapping computation and I/O
  • Each output contains the generated text and optional metadata (token IDs, logprobs)

Step 5: Process and Collect Results

Iterate over the returned outputs to extract generated text, token counts, and any requested metadata. Results are returned in the same order as the input prompts.

Key considerations:

  • Output format is a list of dictionaries with 'text' key
  • Additional fields available when requested (e.g., logprobs, token IDs)

Step 6: Shutdown the Engine

Explicitly shut down the Engine to release GPU memory and terminate worker processes. This is important for scripts that continue execution after generation or run multiple engine instances sequentially.

Key considerations:

  • Call llm.shutdown() to cleanly release resources
  • Required when running multiple engines in the same process lifecycle

Execution Diagram

GitHub URL

Workflow Repository