Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Principle:Allenai Open instruct vLLM Weight Sync

From Leeroopedia


Knowledge Sources
Domains Distributed Computing Model Serving
Last Updated 2026-02-07 00:00 GMT

Overview

vLLM weight synchronization is the process of transferring updated model parameters from training workers to inference engines without reinitializing the inference engine, enabling efficient on-policy generation in RL training.

Description

In GRPO training, the policy model exists in two places simultaneously:

  1. Training replicas managed by DeepSpeed across learner GPUs.
  2. Inference engines managed by vLLM across generation GPUs.

After each training step updates the policy weights, these updates must be propagated to the vLLM engines so that subsequent generations use the latest policy. Naive approaches (saving to disk and reloading, or reinitializing the engine) are prohibitively slow. Instead, the system performs in-place weight updates via direct GPU-to-GPU communication.

The synchronization process:

  1. The learner (rank 0) establishes a torch distributed process group that includes itself and all vLLM worker processes.
  2. For each named parameter in the model, rank 0 broadcasts the parameter tensor to all vLLM workers.
  3. Each vLLM worker receives the tensor and calls model.load_weights() to update the corresponding parameter in place.
  4. When DeepSpeed stage 3 is used, parameters must first be gathered (they are sharded across learner ranks) before broadcasting.

Two synchronization backends are supported:

  • NCCL broadcast: Standard cross-GPU communication via NCCL collective operations. Works across nodes.
  • CUDA IPC: Shared-memory approach for same-node communication. Faster but limited to GPUs on the same machine.

Usage

Weight synchronization occurs after every training step (or periodically, if asynchronous training is used). It is a critical bottleneck in the GRPO pipeline; its latency directly adds to the per-step wall clock time. Minimizing synchronization overhead is achieved through the gather_whole_model option (gather all parameters at once for faster transfer at the cost of peak memory) and inflight updates (allowing generation to continue during synchronization).

Theoretical Basis

The weight synchronization forms a barrier in the training pipeline. Its cost is:

T_sync = T_gather + T_broadcast

T_gather:
  - Stage 0: 0 (parameters are already full)
  - Stage 3: O(P / W) where P = total params, W = number of learner GPUs
    (all-gather across learner group)

T_broadcast:
  - O(P) per engine (NCCL broadcast from rank 0 to each vLLM worker)
  - With tensor parallelism TP, each engine has TP workers
  - Total = O(P * num_engines * TP)

The gather_whole_model=True option trades memory for speed: it gathers all parameters into a single contiguous buffer before broadcasting, avoiding the overhead of per-parameter gather/broadcast cycles.

The monkey-patching approach (WorkerWrap) is necessary because vLLM's internal worker processes do not natively support weight updates from external sources. The WorkerWrap class injects init_process_group() and update_weight() methods into vLLM's worker class, enabling the process group communication without modifying vLLM's source code.

Related Pages

Implemented By

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment