Implementation:Sgl project Sglang LWS Prefill Node Manifest
| Knowledge Sources | |
|---|---|
| Domains | Kubernetes, Deployment, Disaggregated Serving |
| Last Updated | 2026-02-10 00:00 GMT |
Overview
A Kubernetes LeaderWorkerSet (LWS) manifest that configures the prefill node for disaggregated serving of DeepSeek-R1-0528 using SGLang with multi-node tensor parallelism.
Description
p.yaml defines a LeaderWorkerSet resource for the prefill side of a prefill-decode disaggregated serving deployment. The manifest configures a 2-pod group (1 leader + 1 worker) running SGLang in prefill mode with the following key settings:
Model and Parallelism:
- Model: DeepSeek-R1-0528 (large MoE model)
- Tensor parallelism: TP=16 across 2 nodes (8 GPUs each)
- Data parallelism: DP=16 with DP attention enabled
- MoE backend: DeepEP with dynamic expert dispatch and the deepseek EPLB algorithm
- 32 redundant experts for load balancing
Network Configuration:
- RDMA/InfiniBand via mlx5_bond_0 through mlx5_bond_3
- NVSHMEM with IB GID index 3 and NIC PE mapping
- NCCL with 8 QPS per connection, IB traffic class 136, and split data on QPS
- Host networking enabled for high-performance inter-node communication
Memory and Compute:
- 8 NVIDIA GPUs per pod
- Memory fraction static: 0.7
- Chunked prefill size: 524288
- Max prefill tokens: 32768
- Page size: 64
- Context length: 32768
- Max running requests: 1024
- Radix cache disabled
Volume Mounts:
- Shared memory (tmpfs) at /dev/shm
- Model weights at /work/models
- InfiniBand devices at /dev/infiniband
- Fused MoE Triton configs
- SGLang cache directory
Usage
Apply this manifest to a Kubernetes cluster with the LeaderWorkerSet controller installed. Requires nodes with 8 NVIDIA GPUs each, RDMA/InfiniBand networking, and the DeepSeek-R1-0528 model weights pre-staged on the nodes.
Code Reference
Source Location
- Repository: Sgl_project_Sglang
- File: docs/references/multi_node_deployment/lws_pd/lws-examples/p.yaml
- Lines: 1-305
Schema Structure
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
name: deepseekr10528-prefill-main
spec:
leaderWorkerTemplate:
leaderTemplate:
spec:
containers:
- command: [python3, -m, sglang.launch_server, ...]
env: [...]
image: lmsysorg/sglang:latest
resources:
limits:
nvidia.com/gpu: "8"
restartPolicy: RecreateGroupOnPodRestart
size: 2
workerTemplate:
spec:
containers:
- command: [python3, -m, sglang.launch_server, ...]
Import
N/A -- This is a Kubernetes YAML manifest.
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| Model weights | Directory | Yes | DeepSeek-R1-0528 model files at /data1/maas_hosted_models/models/DeepSeek-R1-0528/ |
| InfiniBand devices | Device files | Yes | RDMA devices at /dev/infiniband |
| Fused MoE configs | Directory | Yes | Triton kernel configs at /data1/maas_hosted_models/models/fused_moe_triton/configs |
| Node labels | K8s labels | Yes | Nodes must have pd=yes label |
Outputs
| Name | Type | Description |
|---|---|---|
| SGLang server | TCP service | Prefill server listening on port 30000 (leader) and 30001 (worker) |
| Readiness probe | TCP check | Health check on port 30000 every 30 seconds |
Usage Examples
Deploying the Prefill Node
kubectl apply -f p.yaml
Key SGLang Launch Arguments
python3 -m sglang.launch_server \
--port 30000 \
--host 0.0.0.0 \
--model-path /work/models \
--disaggregation-mode prefill \
--tp 16 \
--dp-size 16 \
--enable-dp-attention \
--enable-dp-lm-head \
--moe-a2a-backend deepep \
--ep-dispatch-algorithm dynamic \
--eplb-algorithm deepseek \
--ep-num-redundant-experts 32 \
--disable-radix-cache \
--context-length 32768