Heuristic:Kserve Kserve Multinode Replica Calculation
| Knowledge Sources | |
|---|---|
| Domains | Distributed_Computing, LLM_Serving |
| Last Updated | 2026-02-13 14:00 GMT |
Overview
Multi-node replicas are calculated as data / dataLocal parallelism, with 80-minute initial startup for large model deployments.
Description
When deploying large LLMs across multiple GPU nodes, the total replica count and per-node GPU allocation must be calculated from the data parallelism (DP) and data-local parallelism settings. The initial deployment includes a long startup time for model download, compilation, and weight loading that must be accounted for in health check configuration.
Usage
Use this heuristic when configuring multi-node LLMInferenceService deployments with data parallelism and/or expert parallelism.
The Insight (Rule of Thumb)
- Action: Calculate total replicas using the formula: `replicas = data / dataLocal`
- Value:
- Example: `data=16, dataLocal=8` produces 2 replicas (2 nodes, 8 GPUs each)
- Example: `data=32, dataLocal=8` produces 4 replicas (4 nodes, 8 GPUs each)
- Trade-off: More replicas increase throughput linearly but require proportionally more GPU nodes.
- Startup time: Set `initialDelaySeconds: 4800` (80 minutes) for large models like DeepSeek-R1 to allow for download, compilation, and initialization.
- All-to-all backends:
- `deepep_high_throughput`: Optimized for batch processing throughput
- `pplx`: Optimized for low-latency decode operations
- Per worker resources: 8 GPUs, 128 CPU cores (limit), 512Gi memory (limit), 800Gi ephemeral storage, 1 RDMA resource
Reasoning
The `data` parameter defines the total data parallelism degree (total GPUs across all nodes), while `dataLocal` defines how many GPUs each node contributes. The division gives the number of nodes (replicas) needed.
The 80-minute initial delay accounts for:
- Model weights download from storage (large models are 100GB+)
- CUDA kernel compilation for the specific GPU architecture
- Model weight sharding and distribution across GPUs
Evidence from `docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/README.md`:
spec:
workerSpec:
data: 16
dataLocal: 8
# Total replicas = 16 / 8 = 2 nodes
# Health check with extended startup time
readinessProbe:
initialDelaySeconds: 4800 # 80 minutes