Implementation:Ggml org Llama cpp Save Load State
| Knowledge Sources | |
|---|---|
| Domains | State_Management, Example |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Demonstrates saving and loading llama.cpp inference state (KV cache and sampler state) to verify deterministic resumption.
Description
Generates tokens from a prompt ("The quick brown fox"), then saves the full context state to a byte buffer and the per-sequence state separately. Creates a new context, loads the saved state, and continues generation to verify the output matches the original. Also tests per-sequence state save/load across multiple sequences. Compares generated text across runs to confirm deterministic behavior after state restoration.
Usage
Use this as a reference implementation for state serialization, critical for applications that need to checkpoint and resume inference (e.g., long-running sessions, server state persistence, or speculative decoding rollback).
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/save-load-state/save-load-state.cpp
- Lines: 1-258
Signature
int main(int argc, char ** argv);
Import
#include "arg.h"
#include "common.h"
#include "llama.h"
#include <vector>
#include <cstdio>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m | string | Yes | Path to the GGUF model file |
| -p | string | No | Input prompt (default: "The quick brown fox") |
| -n | int | No | Number of tokens to generate (default: 16) |
| --seed | int | No | Random seed for reproducibility (default: 1234) |
Outputs
| Name | Type | Description |
|---|---|---|
| dump_state.bin | file | Serialized full context state (KV cache, logits, embeddings) |
| stdout | text | Generated text from original run and resumed runs, with comparison results |
| return | int | Exit code: 0 on success (deterministic match), 1 on failure (mismatch) |
Usage Examples
# Run save-load-state example
./build/bin/llama-save-load-state \
-m model.gguf \
-p "The quick brown fox" \
-n 16 \
--seed 1234
// Core state save/load pattern from the example
// Save full context state
std::vector<uint8_t> state_mem(llama_state_get_size(ctx));
const size_t written = llama_state_get_data(ctx, state_mem.data(), state_mem.size());
// Load state into a new context
llama_state_set_data(ctx2, state_mem.data(), state_mem.size());