Principle:Bigscience workshop Petals Optimizer Setup

Knowledge Sources	Decoupled Weight Decay Regularization (AdamW) PyTorch Optimizers HuggingFace Learning Rate Schedulers
Domains	Deep_Learning, Optimization, Training
Last Updated	2026-02-09 14:00 GMT

Overview

Configuring the optimizer and learning rate scheduler for training only the locally-trainable parameters (prompt embeddings, classification head) in a distributed Petals model.

Description

Optimizer Setup configures the gradient descent optimization for prompt tuning with distributed models. The key constraint is that only local parameters are optimized — the remote transformer blocks are frozen.

Trainable parameters:

prompt_embeddings.weight (if ptune/deep_ptune enabled)
intermediate_prompt_embeddings.weight (if deep_ptune)
score.weight and score.bias (classification head, for classification tasks)

Optimizer choice: AdamW is used as it provides decoupled weight decay, which is important for the small number of trainable parameters in prompt tuning.

Learning rate: Typically higher than full fine-tuning (1e-3 to 1e-2) since only a few parameters are being optimized and the gradient signal passes through many frozen layers.

Scheduler: A linear warmup-then-decay schedule helps stabilize early training when gradients may be noisy from the distributed forward/backward.

Usage

Use this principle after loading a distributed model with prompt tuning enabled and before starting the training loop. Only include parameters where requires_grad=True.

Theoretical Basis

AdamW update rule:

$m_{t} = β_{1} m_{t - 1} + (1 - β_{1}) g_{t}$ $v_{t} = β_{2} v_{t - 1} + (1 - β_{2}) g_{t}^{2}$ $θ_{t} = θ_{t - 1} - α (\frac{{\hat{m}}_{t}}{\sqrt{{\hat{v}}_{t}} + ϵ} + λ θ_{t - 1})$

where $λ$ is the weight decay coefficient, applied separately from the adaptive step.

# Abstract optimizer setup for prompt tuning
trainable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = AdamW(trainable_params, lr=1e-3, weight_decay=0.0)
scheduler = get_scheduler("linear", optimizer, num_warmup_steps=100, num_training_steps=1000)

Related Pages

Implemented By

Implementation:Bigscience_workshop_Petals_AdamW_And_Scheduler

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment