Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Sgl project Sglang LWS Prefill Node Manifest

From Leeroopedia


Knowledge Sources
Domains Kubernetes, Deployment, Disaggregated Serving
Last Updated 2026-02-10 00:00 GMT

Overview

A Kubernetes LeaderWorkerSet (LWS) manifest that configures the prefill node for disaggregated serving of DeepSeek-R1-0528 using SGLang with multi-node tensor parallelism.

Description

p.yaml defines a LeaderWorkerSet resource for the prefill side of a prefill-decode disaggregated serving deployment. The manifest configures a 2-pod group (1 leader + 1 worker) running SGLang in prefill mode with the following key settings:

Model and Parallelism:

  • Model: DeepSeek-R1-0528 (large MoE model)
  • Tensor parallelism: TP=16 across 2 nodes (8 GPUs each)
  • Data parallelism: DP=16 with DP attention enabled
  • MoE backend: DeepEP with dynamic expert dispatch and the deepseek EPLB algorithm
  • 32 redundant experts for load balancing

Network Configuration:

  • RDMA/InfiniBand via mlx5_bond_0 through mlx5_bond_3
  • NVSHMEM with IB GID index 3 and NIC PE mapping
  • NCCL with 8 QPS per connection, IB traffic class 136, and split data on QPS
  • Host networking enabled for high-performance inter-node communication

Memory and Compute:

  • 8 NVIDIA GPUs per pod
  • Memory fraction static: 0.7
  • Chunked prefill size: 524288
  • Max prefill tokens: 32768
  • Page size: 64
  • Context length: 32768
  • Max running requests: 1024
  • Radix cache disabled

Volume Mounts:

  • Shared memory (tmpfs) at /dev/shm
  • Model weights at /work/models
  • InfiniBand devices at /dev/infiniband
  • Fused MoE Triton configs
  • SGLang cache directory

Usage

Apply this manifest to a Kubernetes cluster with the LeaderWorkerSet controller installed. Requires nodes with 8 NVIDIA GPUs each, RDMA/InfiniBand networking, and the DeepSeek-R1-0528 model weights pre-staged on the nodes.

Code Reference

Source Location

Schema Structure

apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
metadata:
  name: deepseekr10528-prefill-main
spec:
  leaderWorkerTemplate:
    leaderTemplate:
      spec:
        containers:
        - command: [python3, -m, sglang.launch_server, ...]
          env: [...]
          image: lmsysorg/sglang:latest
          resources:
            limits:
              nvidia.com/gpu: "8"
    restartPolicy: RecreateGroupOnPodRestart
    size: 2
    workerTemplate:
      spec:
        containers:
        - command: [python3, -m, sglang.launch_server, ...]

Import

N/A -- This is a Kubernetes YAML manifest.

I/O Contract

Inputs

Name Type Required Description
Model weights Directory Yes DeepSeek-R1-0528 model files at /data1/maas_hosted_models/models/DeepSeek-R1-0528/
InfiniBand devices Device files Yes RDMA devices at /dev/infiniband
Fused MoE configs Directory Yes Triton kernel configs at /data1/maas_hosted_models/models/fused_moe_triton/configs
Node labels K8s labels Yes Nodes must have pd=yes label

Outputs

Name Type Description
SGLang server TCP service Prefill server listening on port 30000 (leader) and 30001 (worker)
Readiness probe TCP check Health check on port 30000 every 30 seconds

Usage Examples

Deploying the Prefill Node

kubectl apply -f p.yaml

Key SGLang Launch Arguments

python3 -m sglang.launch_server \
    --port 30000 \
    --host 0.0.0.0 \
    --model-path /work/models \
    --disaggregation-mode prefill \
    --tp 16 \
    --dp-size 16 \
    --enable-dp-attention \
    --enable-dp-lm-head \
    --moe-a2a-backend deepep \
    --ep-dispatch-algorithm dynamic \
    --eplb-algorithm deepseek \
    --ep-num-redundant-experts 32 \
    --disable-radix-cache \
    --context-length 32768

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment