Principle:Bigscience workshop Petals Optimizer Setup
| Knowledge Sources | |
|---|---|
| Domains | Deep_Learning, Optimization, Training |
| Last Updated | 2026-02-09 14:00 GMT |
Overview
Configuring the optimizer and learning rate scheduler for training only the locally-trainable parameters (prompt embeddings, classification head) in a distributed Petals model.
Description
Optimizer Setup configures the gradient descent optimization for prompt tuning with distributed models. The key constraint is that only local parameters are optimized — the remote transformer blocks are frozen.
Trainable parameters:
- prompt_embeddings.weight (if ptune/deep_ptune enabled)
- intermediate_prompt_embeddings.weight (if deep_ptune)
- score.weight and score.bias (classification head, for classification tasks)
Optimizer choice: AdamW is used as it provides decoupled weight decay, which is important for the small number of trainable parameters in prompt tuning.
Learning rate: Typically higher than full fine-tuning (1e-3 to 1e-2) since only a few parameters are being optimized and the gradient signal passes through many frozen layers.
Scheduler: A linear warmup-then-decay schedule helps stabilize early training when gradients may be noisy from the distributed forward/backward.
Usage
Use this principle after loading a distributed model with prompt tuning enabled and before starting the training loop. Only include parameters where requires_grad=True.
Theoretical Basis
AdamW update rule:
where is the weight decay coefficient, applied separately from the adaptive step.
# Abstract optimizer setup for prompt tuning
trainable_params = [p for p in model.parameters() if p.requires_grad]
optimizer = AdamW(trainable_params, lr=1e-3, weight_decay=0.0)
scheduler = get_scheduler("linear", optimizer, num_warmup_steps=100, num_training_steps=1000)