Heuristic:Ggml org Ggml Memory Allocation Strategy
| Knowledge Sources | |
|---|---|
| Domains | Optimization, Infrastructure |
| Last Updated | 2026-02-10 07:40 GMT |
Overview
GGML memory allocator uses first-fit with reuse scoring, tracking up to 256 free blocks, and a virtual buffer system with 16-chunk maximum for efficient graph memory management.
Description
GGML's memory allocator (`ggml-alloc.c`) provides two layers of memory management. The tensor allocator manages individual tensor allocations within a buffer using a free-list approach with up to 256 tracked free blocks. The graph allocator (`ggml_gallocr`) pre-computes a complete memory layout for a computation graph, enabling memory reuse across tensors that have non-overlapping lifetimes. A virtual buffer system tracks allocations across up to 16 chunks, with the last chunk given effectively infinite size (SIZE_MAX/2) to prevent fragmentation-related failures.
Usage
Use this heuristic when sizing memory contexts for GGML operations. Pre-allocate sufficient context memory for tensor metadata (use `1024 * ggml_tensor_overhead()` as a baseline). For computation graphs, allocate space for `GGML_DEFAULT_GRAPH_SIZE * ggml_tensor_overhead() + 3 * ggml_graph_overhead()` to cover forward, backward, and optimizer graphs.
The Insight (Rule of Thumb)
- Action: Pre-compute memory requirements using graph allocation, then allocate once. Avoid dynamic allocation during inference.
- Value:
- Free block tracking limit: 256 blocks (MAX_FREE_BLOCKS)
- Virtual buffer chunk limit: 16 chunks (GGML_VBUFFER_MAX_CHUNKS)
- Default graph size: 2048 nodes (GGML_DEFAULT_GRAPH_SIZE)
- Static context: ~1024 tensor metadata slots
- Compute context: DEFAULT_GRAPH_SIZE tensor slots + 3 graph overheads
- Trade-off: Graph-based pre-allocation eliminates runtime allocation overhead but requires knowing the full computation graph upfront. Virtual buffers add a level of indirection but prevent allocation failures from fragmentation.
Reasoning
The allocator design reflects several key insights:
- First-fit with scoring: The allocator uses a reuse factor to prefer allocations that minimize wasted space. A negative reuse factor means extra allocation is needed; zero means a perfect fit; positive means wasted space. This approach avoids both under-allocation and excessive fragmentation.
- 256-block limit: Tracking more free blocks increases allocation time without significant benefit, as well-structured computation graphs tend to reuse memory efficiently.
- Graph pre-computation: By analyzing tensor lifetimes in the computation graph, the allocator can reuse memory for tensors that do not coexist, dramatically reducing peak memory usage compared to naive allocation.
- 3 graph overheads: Training requires forward graph, backward graph, and optimizer step graph, each needing separate graph metadata allocation.
Code Evidence
Free block limit from `src/ggml-alloc.c:13`:
#define MAX_FREE_BLOCKS 256
Reuse factor scoring from `src/ggml-alloc.c:238-240`:
// reuse_factor < 0 : amount of extra memory that needs to be allocated
// reuse_factor = 0 : allocated free space exactly matches tensor size
// reuse_factor > 0 : superfluous memory that will remain unused
Context sizing for training from `examples/mnist/mnist-common.cpp:122-133`:
const size_t size_meta = 1024*ggml_tensor_overhead();
// ... static context for model weights
const size_t size_meta = GGML_DEFAULT_GRAPH_SIZE*ggml_tensor_overhead()
+ 3*ggml_graph_overhead();
// ... compute context for graphs (forward + backward + optimizer)
Default graph size from `include/ggml.h:233`:
#define GGML_DEFAULT_GRAPH_SIZE 2048