Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Heuristic:Kserve Kserve Multinode Replica Calculation

From Leeroopedia
Knowledge Sources
Domains Distributed_Computing, LLM_Serving
Last Updated 2026-02-13 14:00 GMT

Overview

Multi-node replicas are calculated as data / dataLocal parallelism, with 80-minute initial startup for large model deployments.

Description

When deploying large LLMs across multiple GPU nodes, the total replica count and per-node GPU allocation must be calculated from the data parallelism (DP) and data-local parallelism settings. The initial deployment includes a long startup time for model download, compilation, and weight loading that must be accounted for in health check configuration.

Usage

Use this heuristic when configuring multi-node LLMInferenceService deployments with data parallelism and/or expert parallelism.

The Insight (Rule of Thumb)

  • Action: Calculate total replicas using the formula: `replicas = data / dataLocal`
  • Value:
    • Example: `data=16, dataLocal=8` produces 2 replicas (2 nodes, 8 GPUs each)
    • Example: `data=32, dataLocal=8` produces 4 replicas (4 nodes, 8 GPUs each)
  • Trade-off: More replicas increase throughput linearly but require proportionally more GPU nodes.
  • Startup time: Set `initialDelaySeconds: 4800` (80 minutes) for large models like DeepSeek-R1 to allow for download, compilation, and initialization.
  • All-to-all backends:
    • `deepep_high_throughput`: Optimized for batch processing throughput
    • `pplx`: Optimized for low-latency decode operations
  • Per worker resources: 8 GPUs, 128 CPU cores (limit), 512Gi memory (limit), 800Gi ephemeral storage, 1 RDMA resource

Reasoning

The `data` parameter defines the total data parallelism degree (total GPUs across all nodes), while `dataLocal` defines how many GPUs each node contributes. The division gives the number of nodes (replicas) needed.

The 80-minute initial delay accounts for:

  1. Model weights download from storage (large models are 100GB+)
  2. CUDA kernel compilation for the specific GPU architecture
  3. Model weight sharding and distribution across GPUs

Evidence from `docs/samples/llmisvc/dp-ep/deepseek-r1-gpu-rdma-roce/README.md`:

spec:
  workerSpec:
    data: 16
    dataLocal: 8
    # Total replicas = 16 / 8 = 2 nodes

  # Health check with extended startup time
  readinessProbe:
    initialDelaySeconds: 4800  # 80 minutes

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment