Heuristic:Facebookresearch Habitat lab Resume State Config Override
| Knowledge Sources | |
|---|---|
| Domains | Debugging, Reinforcement_Learning |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
When resuming training, Habitat-Lab silently ignores new configuration overrides and uses the original training config; misunderstanding this causes confusing debugging sessions.
Description
The training resume system has two potentially surprising behaviors. First, when `load_resume_state_config=True` (the default), resuming training with a different configuration silently uses the original config and ignores the new one. Second, when `load_resume_state_config=False`, resuming training raises a `FileExistsError` if the checkpoint folder already contains resume state. Both behaviors are safety mechanisms against accidental hyperparameter corruption, but they frequently catch new users off guard.
Usage
Apply this knowledge whenever resuming a training run or troubleshooting unexpected training behavior after a restart. Common symptoms: changed hyperparameters (learning rate, batch size) having no effect after resume, or unexplained `FileExistsError` on training start.
The Insight (Rule of Thumb)
- Action 1: When changing hyperparameters mid-training, use a new checkpoint folder instead of resuming with different config.
- Action 2: When resuming identical training (after preemption), leave `load_resume_state_config=True` (default).
- Action 3: When you get `FileExistsError`, either delete the checkpoint folder or set `load_resume_state_config=True`.
- Trade-off: Safety vs flexibility. The defaults protect against accidental config corruption at the cost of requiring explicit actions for intentional changes.
Reasoning
Distributed training jobs on SLURM clusters are frequently preempted and restarted. The resume system must ensure training continues seamlessly without human intervention. Allowing config changes during resume would be dangerous in automated requeue scenarios (where the original config should always be used). The `FileExistsError` prevents a separate scenario: a user accidentally starting a new training run that overwrites an existing experiment.
Code evidence from `habitat-baselines/habitat_baselines/common/base_trainer.py:49-58`:
if self.config.habitat_baselines.load_resume_state_config:
if self.config != resume_state_config:
logger.warning(
"\n##################\n"
"You are attempting to resume training with a different "
"configuration than the one used for the original training run. "
"Since load_resume_state_config=True, the ORIGINAL configuration "
"will be used and the new configuration will be IGNORED."
"##################\n"
)
Code evidence from `habitat-baselines/habitat_baselines/rl/ppo/ppo_trainer.py:177-180`:
if resume_state is not None:
if not self.config.habitat_baselines.load_resume_state_config:
raise FileExistsError(
f"The configuration provided has "
f"habitat_baselines.load_resume_state_config=False but a "
f"previous training run exists. You can either delete the "
f"checkpoint folder {self.config.habitat_baselines.checkpoint_folder}, "
f"or change the configuration key "
f"habitat_baselines.checkpoint_folder in your new run."
)