Principle:Intel Ipex llm GaLore Gradient Projection

Knowledge Sources	GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection
Domains	Optimization, Memory_Efficient_Training
Last Updated	2026-02-09 04:00 GMT

Overview

Optimization technique that reduces training memory by projecting gradients into a low-rank subspace, enabling full-parameter learning with reduced optimizer state memory.

Description

GaLore (Gradient Low-Rank Projection) addresses the memory bottleneck of optimizer states (e.g., Adam's first and second moments) by projecting the gradient matrix into a low-rank subspace before applying the optimizer. Unlike LoRA which restricts the weight update to low-rank, GaLore projects only the gradient, allowing the effective weight update to be full-rank while storing optimizer states only for the projected dimensions. The projection basis is periodically updated to track the changing gradient distribution.

Usage

Use this principle when memory is the primary constraint and LoRA's restriction to low-rank weight updates is limiting. GaLore is complementary to quantization and can be combined with it for further memory savings. It is particularly effective for pre-training and full-parameter fine-tuning scenarios.

Theoretical Basis

Given gradient $G \in ℝ^{m \times n}$ and projection matrix $P \in ℝ^{n \times r}$ (where $r ≪ n$ ):

$G_{p r o j} = G P$

The optimizer operates on $G_{p r o j} \in ℝ^{m \times r}$ , reducing state memory from $O (m n)$ to $O (m r)$ .

Pseudo-code Logic:

# Abstract GaLore algorithm
P = compute_svd_projection(G, rank=r)
for step in training:
    G = compute_gradient(model)
    G_proj = G @ P  # Project to low rank
    optimizer.step(G_proj)  # Optimizer on reduced space
    if step % update_proj_gap == 0:
        P = recompute_projection(G, rank=r)

Related Pages

Implementation:Intel_Ipex_llm_GaLore_Finetuning

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment