Implementation:Sgl project Sglang LWS Prefill Node Manifest

Knowledge Sources	Sgl_project_Sglang
Domains	Kubernetes, Deployment, Disaggregated Serving
Last Updated	2026-02-10 00:00 GMT

Overview

A Kubernetes LeaderWorkerSet (LWS) manifest that configures the prefill node for disaggregated serving of DeepSeek-R1-0528 using SGLang with multi-node tensor parallelism.

Description

p.yaml defines a LeaderWorkerSet resource for the prefill side of a prefill-decode disaggregated serving deployment. The manifest configures a 2-pod group (1 leader + 1 worker) running SGLang in prefill mode with the following key settings:

Model and Parallelism:

Model: DeepSeek-R1-0528 (large MoE model)
Tensor parallelism: TP=16 across 2 nodes (8 GPUs each)
Data parallelism: DP=16 with DP attention enabled
MoE backend: DeepEP with dynamic expert dispatch and the deepseek EPLB algorithm
32 redundant experts for load balancing

Network Configuration:

RDMA/InfiniBand via mlx5_bond_0 through mlx5_bond_3
NVSHMEM with IB GID index 3 and NIC PE mapping
NCCL with 8 QPS per connection, IB traffic class 136, and split data on QPS
Host networking enabled for high-performance inter-node communication

Memory and Compute:

8 NVIDIA GPUs per pod
Memory fraction static: 0.7
Chunked prefill size: 524288
Max prefill tokens: 32768
Page size: 64
Context length: 32768
Max running requests: 1024
Radix cache disabled

Volume Mounts:

Shared memory (tmpfs) at /dev/shm
Model weights at /work/models
InfiniBand devices at /dev/infiniband
Fused MoE Triton configs
SGLang cache directory

Usage

Apply this manifest to a Kubernetes cluster with the LeaderWorkerSet controller installed. Requires nodes with 8 NVIDIA GPUs each, RDMA/InfiniBand networking, and the DeepSeek-R1-0528 model weights pre-staged on the nodes.

Code Reference

Source Location

Repository: Sgl_project_Sglang
File: docs/references/multi_node_deployment/lws_pd/lws-examples/p.yaml
Lines: 1-305

Schema Structure

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: deepseekr10528-prefill-main
spec:
  leaderWorkerTemplate:
    leaderTemplate:
      spec:
        containers:
        - command: [python3, -m, sglang.launch_server, ...]
          env: [...]
          image: lmsysorg/sglang:latest
          resources:
            limits:
              nvidia.com/gpu: "8"
    restartPolicy: RecreateGroupOnPodRestart
    size: 2
    workerTemplate:
      spec:
        containers:
        - command: [python3, -m, sglang.launch_server, ...]

Import

N/A -- This is a Kubernetes YAML manifest.

I/O Contract

Inputs

Name	Type	Required	Description
Model weights	Directory	Yes	DeepSeek-R1-0528 model files at /data1/maas_hosted_models/models/DeepSeek-R1-0528/
InfiniBand devices	Device files	Yes	RDMA devices at /dev/infiniband
Fused MoE configs	Directory	Yes	Triton kernel configs at /data1/maas_hosted_models/models/fused_moe_triton/configs
Node labels	K8s labels	Yes	Nodes must have pd=yes label

Outputs

Name	Type	Description
SGLang server	TCP service	Prefill server listening on port 30000 (leader) and 30001 (worker)
Readiness probe	TCP check	Health check on port 30000 every 30 seconds

Usage Examples

Deploying the Prefill Node

kubectl apply -f p.yaml

Key SGLang Launch Arguments

python3 -m sglang.launch_server \
    --port 30000 \
    --host 0.0.0.0 \
    --model-path /work/models \
    --disaggregation-mode prefill \
    --tp 16 \
    --dp-size 16 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --moe-a2a-backend deepep \
    --ep-dispatch-algorithm dynamic \
    --eplb-algorithm deepseek \
    --ep-num-redundant-experts 32 \
    --disable-radix-cache \
    --context-length 32768

Related Pages

Environment:Sgl_project_Sglang_Kubernetes

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment