Principle:Allenai Open instruct Model Configuration
| Knowledge Sources | |
|---|---|
| Domains | Machine Learning, Software Engineering, MLOps |
| Last Updated | 2026-02-07 00:00 GMT |
Overview
Model configuration is the practice of encapsulating all model-related hyperparameters and settings into a structured, serializable object to enable reproducible and configurable training.
Description
Training a large language model requires specifying numerous configuration choices: the model checkpoint, attention implementation, precision, gradient checkpointing, and parameter-efficient fine-tuning settings. Scattering these across different configuration files, command-line arguments, and code paths makes experiments hard to reproduce and modify. A structured model configuration solves this by consolidating all settings into a single typed dataclass.
Key configuration dimensions include:
Model identity: The model_name_or_path and model_revision together identify a unique model checkpoint. Using a specific revision (commit hash) ensures exact reproducibility.
Precision and attention: The dtype field controls the numerical precision (e.g., bfloat16, float32). The attn_implementation field selects between Flash Attention 2 (memory-efficient, requires flash-attn package) and SDPA (PyTorch native, no extra dependencies).
Memory optimization: gradient_checkpointing trades compute for memory by recomputing intermediate activations during the backward pass instead of storing them. This is essential for training large models on limited GPU memory. Note that gradient checkpointing is incompatible with use_cache=True.
Parameter-efficient fine-tuning (PEFT): LoRA configuration includes the rank (lora_r), scaling factor (lora_alpha), dropout rate (lora_dropout), target modules (which layers to apply LoRA to), and task type. These settings control the trade-off between training efficiency and model capacity.
Quantization: For QLoRA workflows, settings for 4-bit or 8-bit loading, quantization type (NF4 or FP4), and nested quantization are included.
Usage
Use a structured model configuration whenever setting up a training experiment. It should be the single source of truth for all model-related settings and should be serialized alongside training logs for reproducibility.
Theoretical Basis
Gradient checkpointing trades computation for memory:
Standard: Memory ~ O(N * L) where N = batch_size, L = num_layers
Each layer stores activations for the backward pass.
Checkpointed: Memory ~ O(N * sqrt(L))
Only sqrt(L) checkpoint layers store activations.
Other activations are recomputed during backward pass.
Computation increases by ~33%.
LoRA parameter count:
Full model parameters: P_full = sum of all weight dimensions
LoRA parameters: P_lora = 2 * r * d * n_target_modules
For a 7B model with r=16, targeting 7 modules per layer, 32 layers:
P_lora = 2 * 16 * 4096 * 7 * 32 = ~29M (0.4% of 7B)
Effective LoRA weight: The LoRA scaling factor determines how much the adaptation affects the output:
delta_W = (alpha / r) * B @ A
A higher alpha/r ratio means larger updates from the LoRA adapter.