Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Heuristic:Unslothai Unsloth Merge Memory Management

From Leeroopedia




Knowledge Sources
Domains Optimization, Memory_Management, Model_Export
Last Updated 2026-02-07 09:00 GMT

Overview

During LoRA merge, Unsloth manages memory across VRAM and disk with `maximum_memory_usage` controlling the fraction of available memory used, defaulting to 0.75 for merge and 0.85-0.9 for GGUF operations.

Description

LoRA merging dequantizes 4-bit weights to 16-bit, applies the LoRA delta, and writes the merged result. This temporarily doubles or triples the model's memory footprint. Unsloth implements a tiered storage strategy: merged weights are first stored in GPU VRAM; when VRAM fills up, weights spill to disk (RAM storage is disabled due to a suspected memory leak). The `maximum_memory_usage` parameter controls how aggressively memory is utilized. On machines with 2 or fewer CPU cores, `safe_serialization` is auto-disabled in favor of faster pickle-based saving, as SafeTensors serialization is ~10x slower on low-core machines.

Usage

Use the default `maximum_memory_usage=0.75` for standard merge operations (SafeTensors save). Increase to `0.85-0.9` for GGUF export where the intermediate format is temporary. Never exceed `0.95` (hard cap). On low-VRAM GPUs (8-12GB), expect disk spilling with a warning: We will save to Disk and not RAM now. This is normal and does not affect output quality.

The Insight (Rule of Thumb)

  • Action: Use `maximum_memory_usage=0.75` for merge, `0.85` for GGUF.
  • Value: VRAM budget = `total_VRAM * maximum_memory_usage`; RAM budget = `available_RAM * maximum_memory_usage - shard_size`.
  • Trade-off: Higher values risk OOM crashes; lower values cause more disk I/O. The 0.75 default leaves 25% VRAM headroom for PyTorch overhead.
  • Compatibility: On 2-core machines, `safe_serialization` is auto-disabled for 10x faster saving.

Reasoning

The merge process creates full-precision (16-bit) copies of each layer. For a 7B model, this requires ~14GB of temporary storage. The tiered VRAM-to-disk strategy avoids OOM while keeping as much data in fast GPU memory as possible. RAM storage was disabled because of a suspected memory leak (`[TODO] Saving to RAM seems to leak memory???` in save.py:646). The different defaults (0.75 for merge, 0.85 for GGUF) reflect that GGUF conversion is a pipeline where intermediate files are consumed and deleted, making higher memory usage safe.

Memory budget calculation from `save.py:540-600`:

max_ram = psutil.virtual_memory().available
sharded_ram_usage = 5 * 1024 * 1024 * 1024
if safe_serialization:
    max_ram -= sharded_ram_usage
else:
    max_ram -= sharded_ram_usage * 0.25
max_ram = int(max(0, max_ram) * maximum_memory_usage)
max_vram = int(
    torch.cuda.get_device_properties(0).total_memory * maximum_memory_usage
)

Tiered storage with RAM disabled from `save.py:640-660`:

if (torch.cuda.memory_allocated() + W.nbytes) < max_vram:
    # Save to GPU memory
    state_dict[name] = W
# [TODO] Saving to RAM seems to leak memory???
# elif (max_ram - W.nbytes) > 0:
#     state_dict[name] = W.to("cpu", non_blocking = True, copy = True)
else:
    # Save to Disk
    logger.warning_once("\nWe will save to Disk and not RAM now.")

Low-core CPU optimization from `save.py:560-575`:

n_cpus = psutil.cpu_count(logical = False)
elif safe_serialization and (n_cpus <= 2):
    logger.warning_once(
        f"Unsloth: You have {n_cpus} CPUs. Using `safe_serialization` is 10x slower.\n"
        f"We shall switch to Pytorch saving, which might take 3 minutes and not 30 minutes."
    )
    safe_serialization = False
    save_function = fast_save_pickle

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment