Heuristic:Kserve Kserve Multinode Replica Calculation

Knowledge Sources	KServe DeepSeek-R1 Deployment
Domains	Distributed_Computing, LLM_Serving
Last Updated	2026-02-13 14:00 GMT

Overview

Multi-node replicas are calculated as data / dataLocal parallelism, with 80-minute initial startup for large model deployments.

Description

When deploying large LLMs across multiple GPU nodes, the total replica count and per-node GPU allocation must be calculated from the data parallelism (DP) and data-local parallelism settings. The initial deployment includes a long startup time for model download, compilation, and weight loading that must be accounted for in health check configuration.

Usage

Use this heuristic when configuring multi-node LLMInferenceService deployments with data parallelism and/or expert parallelism.

The Insight (Rule of Thumb)

Action: Calculate total replicas using the formula: `replicas = data / dataLocal`
Value:
- Example: `data=16, dataLocal=8` produces 2 replicas (2 nodes, 8 GPUs each)
- Example: `data=32, dataLocal=8` produces 4 replicas (4 nodes, 8 GPUs each)
Trade-off: More replicas increase throughput linearly but require proportionally more GPU nodes.
Startup time: Set `initialDelaySeconds: 4800` (80 minutes) for large models like DeepSeek-R1 to allow for download, compilation, and initialization.
All-to-all backends:
- `deepep_high_throughput`: Optimized for batch processing throughput
- `pplx`: Optimized for low-latency decode operations
Per worker resources: 8 GPUs, 128 CPU cores (limit), 512Gi memory (limit), 800Gi ephemeral storage, 1 RDMA resource

Reasoning

The `data` parameter defines the total data parallelism degree (total GPUs across all nodes), while `dataLocal` defines how many GPUs each node contributes. The division gives the number of nodes (replicas) needed.

The 80-minute initial delay accounts for:

Model weights download from storage (large models are 100GB+)
CUDA kernel compilation for the specific GPU architecture
Model weight sharding and distribution across GPUs

Evidence from `docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/README.md`:

spec:
  workerSpec:
    data: 16
    dataLocal: 8
    # Total replicas = 16 / 8 = 2 nodes

  # Health check with extended startup time
  readinessProbe:
    initialDelaySeconds: 4800  # 80 minutes

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment