Heuristic:Ggml org Ggml Memory Allocation Strategy

Knowledge Sources	GGML Allocator internals
Domains	Optimization, Infrastructure
Last Updated	2026-02-10 07:40 GMT

Overview

GGML memory allocator uses first-fit with reuse scoring, tracking up to 256 free blocks, and a virtual buffer system with 16-chunk maximum for efficient graph memory management.

Description

GGML's memory allocator (`ggml-alloc.c`) provides two layers of memory management. The tensor allocator manages individual tensor allocations within a buffer using a free-list approach with up to 256 tracked free blocks. The graph allocator (`ggml_gallocr`) pre-computes a complete memory layout for a computation graph, enabling memory reuse across tensors that have non-overlapping lifetimes. A virtual buffer system tracks allocations across up to 16 chunks, with the last chunk given effectively infinite size (SIZE_MAX/2) to prevent fragmentation-related failures.

Usage

Use this heuristic when sizing memory contexts for GGML operations. Pre-allocate sufficient context memory for tensor metadata (use `1024 * ggml_tensor_overhead()` as a baseline). For computation graphs, allocate space for `GGML_DEFAULT_GRAPH_SIZE * ggml_tensor_overhead() + 3 * ggml_graph_overhead()` to cover forward, backward, and optimizer graphs.

The Insight (Rule of Thumb)

Action: Pre-compute memory requirements using graph allocation, then allocate once. Avoid dynamic allocation during inference.
Value:
- Free block tracking limit: 256 blocks (MAX_FREE_BLOCKS)
- Virtual buffer chunk limit: 16 chunks (GGML_VBUFFER_MAX_CHUNKS)
- Default graph size: 2048 nodes (GGML_DEFAULT_GRAPH_SIZE)
- Static context: ~1024 tensor metadata slots
- Compute context: DEFAULT_GRAPH_SIZE tensor slots + 3 graph overheads
Trade-off: Graph-based pre-allocation eliminates runtime allocation overhead but requires knowing the full computation graph upfront. Virtual buffers add a level of indirection but prevent allocation failures from fragmentation.

Reasoning

The allocator design reflects several key insights:

First-fit with scoring: The allocator uses a reuse factor to prefer allocations that minimize wasted space. A negative reuse factor means extra allocation is needed; zero means a perfect fit; positive means wasted space. This approach avoids both under-allocation and excessive fragmentation.
256-block limit: Tracking more free blocks increases allocation time without significant benefit, as well-structured computation graphs tend to reuse memory efficiently.
Graph pre-computation: By analyzing tensor lifetimes in the computation graph, the allocator can reuse memory for tensors that do not coexist, dramatically reducing peak memory usage compared to naive allocation.
3 graph overheads: Training requires forward graph, backward graph, and optimizer step graph, each needing separate graph metadata allocation.

Code Evidence

Free block limit from `src/ggml-alloc.c:13`:

#define MAX_FREE_BLOCKS 256

Reuse factor scoring from `src/ggml-alloc.c:238-240`:

// reuse_factor < 0 : amount of extra memory that needs to be allocated
// reuse_factor = 0 : allocated free space exactly matches tensor size
// reuse_factor > 0 : superfluous memory that will remain unused

Context sizing for training from `examples/mnist/mnist-common.cpp:122-133`:

const size_t size_meta = 1024*ggml_tensor_overhead();
// ... static context for model weights

const size_t size_meta = GGML_DEFAULT_GRAPH_SIZE*ggml_tensor_overhead()
    + 3*ggml_graph_overhead();
// ... compute context for graphs (forward + backward + optimizer)

Default graph size from `include/ggml.h:233`:

#define GGML_DEFAULT_GRAPH_SIZE 2048

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment