Principle:NVIDIA NeMo Aligner Hydra Training Configuration
| Principle: Hydra Training Configuration | |
|---|---|
| Type | Principle |
| Project | NVIDIA NeMo Aligner |
| Domains | Configuration_Management, MLOps |
| Related Implementations | Implementation:NVIDIA_NeMo_Aligner_Hydra_Config_Loading |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Pattern for declaratively specifying all training hyperparameters, model architecture settings, and data paths through hierarchical YAML configuration files.
Description
NeMo Aligner uses Hydra and OmegaConf for configuration management. Every training script is decorated with @hydra_runner, which loads a YAML config file and allows CLI overrides. The configuration hierarchy covers:
- Trainer settings -- number of GPUs, precision mode, max epochs/steps
- Model architecture -- tensor/pipeline parallelism degrees, micro/global batch sizes, hidden dimensions
- Optimizer -- learning rate, weight decay, scheduler type and warmup
- Data -- file paths, sequence length, data formats, number of workers
- Algorithm-specific parameters -- KL penalty coefficient, loss type, reward scaling
This pattern decouples hyperparameter specification from code, enabling reproducible experiments and easy parameter sweeps without modifying source files.
Usage
Use this pattern in every training script. The workflow is:
- Define sensible defaults in a YAML configuration file
- Override specific values via CLI arguments for individual experiments
- Use Hydra interpolation (
${...}) for derived values that depend on other config entries
This is critical for managing the complexity of distributed training configurations, where tensor parallelism (TP), pipeline parallelism (PP), and data parallelism (DP) sizes must be coordinated along with mixed precision settings and gradient accumulation.
Theoretical Basis
The principle is grounded in hierarchical configuration management. Hydra resolves config groups, interpolations, and CLI overrides into a single unified DictConfig object.
The resolution pattern follows this order:
1. YAML file defines default values for all parameters
2. @hydra_runner decorator loads and parses the YAML
3. CLI overrides are merged on top of YAML defaults
4. OmegaConf resolves interpolations (e.g., ${model.hidden_size})
5. The final resolved DictConfig drives all training parameters
A typical configuration structure:
trainer:
devices: 8
precision: bf16
max_steps: 1000
model:
micro_batch_size: 4
global_batch_size: 32
tensor_model_parallel_size: 2
pipeline_model_parallel_size: 1
encoder_seq_length: 4096
data:
data_path: /data/sft_train.jsonl
seq_length: ${model.encoder_seq_length}
optim:
name: fused_adam
lr: 1e-5
weight_decay: 0.01
exp_manager:
checkpoint_callback_params:
save_top_k: 3
Practical Guide
To use Hydra configuration in a training workflow:
- Define a YAML file with sections for
trainer,exp_manager, andmodel(including nesteddata,optim, and algorithm-specific parameters) - Use Hydra interpolation (
${...}) for values derived from other config entries to avoid duplication - Override at the CLI for experiment variations:
python train_sft.py \
model.optim.lr=1e-5 \
trainer.max_steps=1000 \
model.data.data_path=/new/data/path.jsonl
- Leverage config groups to swap entire subsections (e.g., different optimizer configs) without rewriting the full YAML