Environment:Kserve Kserve Leader Worker Set
| Knowledge Sources | |
|---|---|
| Domains | Infrastructure, Distributed_Computing |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
LeaderWorkerSet (LWS) v0.7.0 for managing multi-node distributed inference workloads with leader-worker topology.
Description
LWS is a Kubernetes API that manages groups of pods with a leader-worker topology. In KServe, it enables multi-node distributed inference for large models that span multiple GPU nodes (e.g., DeepSeek-R1 with data and expert parallelism). The LWS controller coordinates pod placement, startup ordering, and provides leader address discovery through environment variables.
Usage
Use this environment for multi-node distributed inference where a single model is too large to fit on one node's GPUs. Required for data parallelism (DP) and expert parallelism (EP) deployment patterns.
System Requirements
| Category | Requirement | Notes |
|---|---|---|
| Kubernetes | >= 1.24 | Base requirement |
| LWS | v0.7.0 | From kserve-deps.env |
| Helm | v3.16.3+ | For LWS installation |
Dependencies
Helm Charts
- `oci://registry.k8s.io/lws/charts/lws`
Credentials
No additional credentials required.
Quick Install
helm install lws oci://registry.k8s.io/lws/charts/lws \
--version "${LWS_VERSION}" -n lws-system --create-namespace
Code Evidence
LWS version from `kserve-deps.env:38`:
LWS_VERSION=v0.7.0
Go module dependency from `go.mod`:
sigs.k8s.io/lws v0.7.0
LWS environment variables used in worker pods:
LWS_LEADER_ADDRESS # Address of the leader pod
LWS_WORKER_INDEX # Index of the current worker
Common Errors
| Error Message | Cause | Solution |
|---|---|---|
| `LeaderWorkerSet CRD not found` | LWS not installed | Install via Helm chart |
| Worker pods stuck in Pending | Insufficient GPU resources across nodes | Ensure enough multi-GPU nodes available |
Compatibility Notes
- Namespace: LWS controller runs in `lws-system` namespace
- LLMIsvc: Required for LLMInferenceService multi-node deployments
- Replica calculation: Total replicas = `data` / `dataLocal` (e.g., 16/8 = 2 nodes)