Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Workflow:Sgl project Sglang Online Serving With OpenAI API

From Leeroopedia


Knowledge Sources
Domains LLM_Inference, API_Serving, Production_Deployment
Last Updated 2026-02-09 00:00 GMT

Overview

End-to-end process for deploying a large language model as a persistent HTTP server with an OpenAI-compatible API using SGLang.

Description

This workflow covers launching an SGLang server that exposes OpenAI-compatible endpoints (chat completions, text completions, embeddings) and sending requests to it using standard OpenAI client libraries. The server handles continuous batching, paged attention, RadixAttention prefix caching, and efficient GPU memory management automatically. It supports features like streaming, LoRA adapter switching, structured outputs, and tensor parallelism for multi-GPU serving.

Usage

Execute this workflow when you need a persistent, low-latency inference endpoint for real-time applications such as chatbots, code assistants, API backends, or any service requiring an OpenAI-compatible interface. This is the primary production deployment pattern for SGLang.

Execution Steps

Step 1: Launch the SGLang Server

Start the SGLang server from the command line specifying the model path, port, and serving configuration. The server initializes the model, allocates GPU memory for the KV cache, compiles CUDA graphs for common batch sizes, and begins listening for HTTP requests.

Key considerations:

  • Use python -m sglang.launch_server with --model-path and --port arguments
  • Set --tp (tensor parallelism) for multi-GPU serving
  • Configure --mem-fraction-static to control KV cache memory allocation
  • For LoRA serving, add --enable-lora and --lora-paths flags
  • Server exposes /v1/chat/completions, /v1/completions, and /v1/embeddings endpoints

Step 2: Configure the OpenAI Client

Set up the OpenAI Python client (or any HTTP client) to point to the SGLang server URL. Since SGLang implements the OpenAI API specification, existing OpenAI client code works with minimal changes — only the base_url needs to be updated.

Key considerations:

  • Set base_url to http://{host}:{port}/v1
  • The api_key can be any string (server does not validate by default)
  • Compatible with openai Python package, curl, and any OpenAI-compatible client

Step 3: Send Chat Completion Requests

Submit chat completion requests using the standard messages format with system, user, and assistant roles. The server processes requests with continuous batching, automatically managing concurrent requests for optimal throughput.

Key considerations:

  • Standard OpenAI chat format with role/content message objects
  • Supports streaming via stream=True parameter
  • Supports response prefilling with continue_final_message flag
  • For LoRA adapters, use model:adapter_name syntax in the model field

Step 4: Handle Streaming Responses

For latency-sensitive applications, enable streaming to receive tokens as they are generated. The server sends Server-Sent Events (SSE) with incremental token chunks.

Key considerations:

  • Set stream=True in the request
  • Process chunks iteratively as they arrive
  • Each chunk contains a delta with the new content

Step 5: Monitor Server Health and Metrics

Use the built-in health and metrics endpoints to monitor server status, request throughput, cache hit rates, and GPU utilization. SGLang exposes Prometheus-compatible metrics for integration with monitoring stacks.

Key considerations:

  • /health endpoint for liveness checks
  • /metrics endpoint for Prometheus scraping
  • Grafana dashboards available in the examples/monitoring directory

Execution Diagram

GitHub URL

Workflow Repository