Principle:Intel Ipex llm GaLore Gradient Projection
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Memory_Efficient_Training |
| Last Updated | 2026-02-09 04:00 GMT |
Overview
Optimization technique that reduces training memory by projecting gradients into a low-rank subspace, enabling full-parameter learning with reduced optimizer state memory.
Description
GaLore (Gradient Low-Rank Projection) addresses the memory bottleneck of optimizer states (e.g., Adam's first and second moments) by projecting the gradient matrix into a low-rank subspace before applying the optimizer. Unlike LoRA which restricts the weight update to low-rank, GaLore projects only the gradient, allowing the effective weight update to be full-rank while storing optimizer states only for the projected dimensions. The projection basis is periodically updated to track the changing gradient distribution.
Usage
Use this principle when memory is the primary constraint and LoRA's restriction to low-rank weight updates is limiting. GaLore is complementary to quantization and can be combined with it for further memory savings. It is particularly effective for pre-training and full-parameter fine-tuning scenarios.
Theoretical Basis
Given gradient and projection matrix (where ):
The optimizer operates on , reducing state memory from to .
Pseudo-code Logic:
# Abstract GaLore algorithm
P = compute_svd_projection(G, rank=r)
for step in training:
G = compute_gradient(model)
G_proj = G @ P # Project to low rank
optimizer.step(G_proj) # Optimizer on reduced space
if step % update_proj_gap == 0:
P = recompute_projection(G, rank=r)