Implementation:Allenai Open instruct Save With Accelerate
Appearance
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Distributed Systems, MLOps |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Concrete tool for saving model checkpoints in distributed training environments provided by the Open Instruct library.
Description
The save_with_accelerate() function handles the complexities of saving model checkpoints when training with HuggingFace Accelerate and DeepSpeed. It performs the following steps:
- Generation config setup: Sets the model's generation config to safe defaults. For OLMo models (detected by chat template name), it uses a specialized generation config with two EOS tokens (
<|im_end|>and<|endoftext|>). - Model unwrapping: Unwraps the model from Accelerate/DeepSpeed wrappers to get the underlying
PreTrainedModel. - Model attribute extraction: If
model_attribute_to_saveis specified, extracts a specific sub-model (e.g., the policy model from a PPO wrapper). - State dict gathering: Uses
accelerator.get_state_dict(model)to gather the full state dict from all distributed processes. The wrapped model must be passed (not the unwrapped one). - State dict filtering: If saving a model attribute, filters the state dict to only include keys starting with the attribute name and strips the prefix.
- Saving: For LoRA models, uses PEFT's
save_pretrained()on the main process only. For full models, uses HuggingFace'ssave_pretrained()with Accelerate's save function and non-safetensors format. - Tokenizer saving: Saves the tokenizer to the same directory on the main process.
Usage
Call this function at checkpoint intervals and at the end of training. It is called from the training loop in finetune.py and other training scripts (DPO, GRPO).
Code Reference
Source Location
- Repository: Open Instruct
- File:
open_instruct/model_utils.py - Lines: L475-535
Signature
def save_with_accelerate(
accelerator: Accelerator,
model: torch.nn.Module,
tokenizer: transformers.PreTrainedTokenizer,
output_dir: str,
use_lora: bool = False,
model_attribute_to_save: str | None = None,
chat_template_name: str = "tulu",
) -> None:
Import
from open_instruct.model_utils import save_with_accelerate
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| accelerator | Accelerator | Yes | The HuggingFace Accelerate instance managing distributed training state. |
| model | torch.nn.Module | Yes | The wrapped model (must be the Accelerate-wrapped version for correct state dict gathering). |
| tokenizer | PreTrainedTokenizer | Yes | The tokenizer to save alongside the model. |
| output_dir | str | Yes | Directory path where the checkpoint will be saved. |
| use_lora | bool | No | Whether the model uses LoRA adapters (changes save behavior). Defaults to False. |
| model_attribute_to_save | str or None | No | If set, saves only a specific attribute/sub-model (e.g., "policy" for PPO). Defaults to None (save full model).
|
| chat_template_name | str | No | Chat template name used to detect OLMo models for generation config. Defaults to "tulu".
|
Outputs
| Name | Type | Description |
|---|---|---|
| (side effects) | None | Saves model weights, tokenizer files, and generation config to output_dir. Only the main process writes to disk.
|
Saved Files
The function creates the following files in output_dir:
| File | Condition | Description |
|---|---|---|
pytorch_model.bin (or sharded) |
Full model (non-LoRA) | Full model weights in PyTorch format. |
adapter_model.bin |
LoRA model | Only the LoRA adapter weights. |
adapter_config.json |
LoRA model | LoRA configuration (rank, alpha, target modules). |
config.json |
Always | Model architecture configuration. |
generation_config.json |
Always | Generation parameters (temperature, top_p, EOS tokens). |
tokenizer.json |
Always | Tokenizer vocabulary and configuration. |
tokenizer_config.json |
Always | Tokenizer settings and chat template. |
special_tokens_map.json |
Always | Special token definitions. |
Usage Examples
Basic Usage
from open_instruct.model_utils import save_with_accelerate
# During training loop, at checkpoint step:
save_with_accelerate(
accelerator=accelerator,
model=model,
tokenizer=tokenizer,
output_dir="output/checkpoint-1000",
use_lora=False,
chat_template_name="tulu",
)
Saving LoRA Adapter
save_with_accelerate(
accelerator=accelerator,
model=model,
tokenizer=tokenizer,
output_dir="output/lora-checkpoint",
use_lora=True,
)
Saving a Sub-Model (PPO Policy)
save_with_accelerate(
accelerator=accelerator,
model=ppo_model,
tokenizer=tokenizer,
output_dir="output/policy-checkpoint",
model_attribute_to_save="policy",
)
Related Pages
Implements Principle
Page Connections
Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment