Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Environment:Kserve Kserve Leader Worker Set

From Leeroopedia
Knowledge Sources
Domains Infrastructure, Distributed_Computing
Last Updated 2026-02-13 14:00 GMT

Overview

LeaderWorkerSet (LWS) v0.7.0 for managing multi-node distributed inference workloads with leader-worker topology.

Description

LWS is a Kubernetes API that manages groups of pods with a leader-worker topology. In KServe, it enables multi-node distributed inference for large models that span multiple GPU nodes (e.g., DeepSeek-R1 with data and expert parallelism). The LWS controller coordinates pod placement, startup ordering, and provides leader address discovery through environment variables.

Usage

Use this environment for multi-node distributed inference where a single model is too large to fit on one node's GPUs. Required for data parallelism (DP) and expert parallelism (EP) deployment patterns.

System Requirements

Category Requirement Notes
Kubernetes >= 1.24 Base requirement
LWS v0.7.0 From kserve-deps.env
Helm v3.16.3+ For LWS installation

Dependencies

Helm Charts

  • `oci://registry.k8s.io/lws/charts/lws`

Credentials

No additional credentials required.

Quick Install

helm install lws oci://registry.k8s.io/lws/charts/lws \
    --version "${LWS_VERSION}" -n lws-system --create-namespace

Code Evidence

LWS version from `kserve-deps.env:38`:

LWS_VERSION=v0.7.0

Go module dependency from `go.mod`:

sigs.k8s.io/lws v0.7.0

LWS environment variables used in worker pods:

LWS_LEADER_ADDRESS   # Address of the leader pod
LWS_WORKER_INDEX     # Index of the current worker

Common Errors

Error Message Cause Solution
`LeaderWorkerSet CRD not found` LWS not installed Install via Helm chart
Worker pods stuck in Pending Insufficient GPU resources across nodes Ensure enough multi-GPU nodes available

Compatibility Notes

  • Namespace: LWS controller runs in `lws-system` namespace
  • LLMIsvc: Required for LLMInferenceService multi-node deployments
  • Replica calculation: Total replicas = `data` / `dataLocal` (e.g., 16/8 = 2 nodes)

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment