Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Allenai Open instruct Save With Accelerate

From Leeroopedia


Knowledge Sources
Domains Machine Learning, Distributed Systems, MLOps
Last Updated 2026-02-07 00:00 GMT

Overview

Concrete tool for saving model checkpoints in distributed training environments provided by the Open Instruct library.

Description

The save_with_accelerate() function handles the complexities of saving model checkpoints when training with HuggingFace Accelerate and DeepSpeed. It performs the following steps:

  1. Generation config setup: Sets the model's generation config to safe defaults. For OLMo models (detected by chat template name), it uses a specialized generation config with two EOS tokens (<|im_end|> and <|endoftext|>).
  2. Model unwrapping: Unwraps the model from Accelerate/DeepSpeed wrappers to get the underlying PreTrainedModel.
  3. Model attribute extraction: If model_attribute_to_save is specified, extracts a specific sub-model (e.g., the policy model from a PPO wrapper).
  4. State dict gathering: Uses accelerator.get_state_dict(model) to gather the full state dict from all distributed processes. The wrapped model must be passed (not the unwrapped one).
  5. State dict filtering: If saving a model attribute, filters the state dict to only include keys starting with the attribute name and strips the prefix.
  6. Saving: For LoRA models, uses PEFT's save_pretrained() on the main process only. For full models, uses HuggingFace's save_pretrained() with Accelerate's save function and non-safetensors format.
  7. Tokenizer saving: Saves the tokenizer to the same directory on the main process.

Usage

Call this function at checkpoint intervals and at the end of training. It is called from the training loop in finetune.py and other training scripts (DPO, GRPO).

Code Reference

Source Location

  • Repository: Open Instruct
  • File: open_instruct/model_utils.py
  • Lines: L475-535

Signature

def save_with_accelerate(
    accelerator: Accelerator,
    model: torch.nn.Module,
    tokenizer: transformers.PreTrainedTokenizer,
    output_dir: str,
    use_lora: bool = False,
    model_attribute_to_save: str | None = None,
    chat_template_name: str = "tulu",
) -> None:

Import

from open_instruct.model_utils import save_with_accelerate

I/O Contract

Inputs

Name Type Required Description
accelerator Accelerator Yes The HuggingFace Accelerate instance managing distributed training state.
model torch.nn.Module Yes The wrapped model (must be the Accelerate-wrapped version for correct state dict gathering).
tokenizer PreTrainedTokenizer Yes The tokenizer to save alongside the model.
output_dir str Yes Directory path where the checkpoint will be saved.
use_lora bool No Whether the model uses LoRA adapters (changes save behavior). Defaults to False.
model_attribute_to_save str or None No If set, saves only a specific attribute/sub-model (e.g., "policy" for PPO). Defaults to None (save full model).
chat_template_name str No Chat template name used to detect OLMo models for generation config. Defaults to "tulu".

Outputs

Name Type Description
(side effects) None Saves model weights, tokenizer files, and generation config to output_dir. Only the main process writes to disk.

Saved Files

The function creates the following files in output_dir:

File Condition Description
pytorch_model.bin (or sharded) Full model (non-LoRA) Full model weights in PyTorch format.
adapter_model.bin LoRA model Only the LoRA adapter weights.
adapter_config.json LoRA model LoRA configuration (rank, alpha, target modules).
config.json Always Model architecture configuration.
generation_config.json Always Generation parameters (temperature, top_p, EOS tokens).
tokenizer.json Always Tokenizer vocabulary and configuration.
tokenizer_config.json Always Tokenizer settings and chat template.
special_tokens_map.json Always Special token definitions.

Usage Examples

Basic Usage

from open_instruct.model_utils import save_with_accelerate

# During training loop, at checkpoint step:
save_with_accelerate(
    accelerator=accelerator,
    model=model,
    tokenizer=tokenizer,
    output_dir="output/checkpoint-1000",
    use_lora=False,
    chat_template_name="tulu",
)

Saving LoRA Adapter

save_with_accelerate(
    accelerator=accelerator,
    model=model,
    tokenizer=tokenizer,
    output_dir="output/lora-checkpoint",
    use_lora=True,
)

Saving a Sub-Model (PPO Policy)

save_with_accelerate(
    accelerator=accelerator,
    model=ppo_model,
    tokenizer=tokenizer,
    output_dir="output/policy-checkpoint",
    model_attribute_to_save="policy",
)

Related Pages

Implements Principle

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment