Heuristic:Avhz RustQuant Learning Rate Tuning
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Calibration |
| Last Updated | 2026-02-07 19:00 GMT |
Overview
Tuning the learning rate and convergence tolerance for RustQuant's gradient descent optimizer, with `sqrt(f64::EPSILON)` as the default tolerance.
Description
RustQuant's `GradientDescent` optimizer uses a fixed learning rate with convergence checked via the Euclidean norm of the gradient against a tolerance threshold. The default tolerance is `f64::EPSILON.sqrt()` (approximately `1.49e-8`), which balances machine precision limits with practical convergence. The test suite reveals that learning rate selection depends heavily on the objective function landscape: well-behaved convex functions tolerate `lr = 0.1`, while ill-conditioned functions like Rosenbrock require `lr = 0.001` with 10x more iterations.
Usage
Apply this heuristic when using `GradientDescent::new()` for model calibration workflows. Start with the recommended defaults and adjust based on observed convergence behavior.
The Insight (Rule of Thumb)
- Default Tolerance: Leave tolerance as `None` to use `f64::EPSILON.sqrt()` (~1.49e-8). This is near-optimal for double-precision floating point.
- Well-conditioned functions (x^2, Booth): Use `learning_rate = 0.1`, `max_iterations = 1000`.
- Ill-conditioned functions (Rosenbrock): Use `learning_rate = 0.001`, `max_iterations = 10000`.
- Trade-off: Larger learning rate converges faster but risks oscillation or divergence. Smaller learning rate is more stable but may require many more iterations.
- Diagnostic: Enable `verbose = true` to monitor gradient norm and function value per iteration. If oscillations occur, reduce the learning rate by 10x.
Reasoning
The convergence tolerance `f64::EPSILON.sqrt()` is the standard choice for gradient-based methods because:
- Gradient computations via autodiff introduce rounding errors of order `f64::EPSILON`
- The gradient norm threshold should be above the noise floor but below meaningful signal
- `sqrt(EPSILON) ~ 1.49e-8` sits at the geometric mean between machine precision and unity
The Rosenbrock function is a classic test for optimizers because its minimum lies in a narrow curved valley where the gradient is nearly perpendicular to the valley direction. The 100x ratio between the `0.1` and `0.001` learning rates directly reflects the condition number difference between well-conditioned and ill-conditioned problems.
Code Evidence
Default tolerance fallback from `gradient_descent.rs:160`:
let tolerance = self.tolerance.unwrap_or(f64::EPSILON.sqrt());
Stationarity check from `gradient_descent.rs:137-139`:
fn is_stationary(gradient: &[f64], tol: f64) -> bool {
gradient.iter().map(|&x| x * x).sum::<f64>().sqrt() < tol
}
Well-conditioned test (lr=0.1, 1000 iters) from `gradient_descent.rs:262`:
let gd = GradientDescent::new(0.1, 1000, Some(0.000_001));
let result = gd.optimize(f, &[10.0], false);
Ill-conditioned test (lr=0.001, 10000 iters) from `gradient_descent.rs:309`:
let gd = GradientDescent::new(0.001, 10000, Some(0.000_001));
let result = gd.optimize(f, &[0.0, 5.0], false);