Implementation:Ggml org Llama cpp Save Load State

Knowledge Sources	Ggml_org_Llama_cpp
Domains	State_Management, Example
Last Updated	2026-02-15 00:00 GMT

Overview

Demonstrates saving and loading llama.cpp inference state (KV cache and sampler state) to verify deterministic resumption.

Description

Generates tokens from a prompt ("The quick brown fox"), then saves the full context state to a byte buffer and the per-sequence state separately. Creates a new context, loads the saved state, and continues generation to verify the output matches the original. Also tests per-sequence state save/load across multiple sequences. Compares generated text across runs to confirm deterministic behavior after state restoration.

Usage

Use this as a reference implementation for state serialization, critical for applications that need to checkpoint and resume inference (e.g., long-running sessions, server state persistence, or speculative decoding rollback).

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/save-load-state/save-load-state.cpp
Lines: 1-258

Signature

int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "llama.h"

#include <vector>
#include <cstdio>

I/O Contract

Inputs

Name	Type	Required	Description
-m	string	Yes	Path to the GGUF model file
-p	string	No	Input prompt (default: "The quick brown fox")
-n	int	No	Number of tokens to generate (default: 16)
--seed	int	No	Random seed for reproducibility (default: 1234)

Outputs

Name	Type	Description
dump_state.bin	file	Serialized full context state (KV cache, logits, embeddings)
stdout	text	Generated text from original run and resumed runs, with comparison results
return	int	Exit code: 0 on success (deterministic match), 1 on failure (mismatch)

Usage Examples

# Run save-load-state example
./build/bin/llama-save-load-state \
  -m model.gguf \
  -p "The quick brown fox" \
  -n 16 \
  --seed 1234

// Core state save/load pattern from the example
// Save full context state
std::vector<uint8_t> state_mem(llama_state_get_size(ctx));
const size_t written = llama_state_get_data(ctx, state_mem.data(), state_mem.size());

// Load state into a new context
llama_state_set_data(ctx2, state_mem.data(), state_mem.size());

Related Pages

Principle:Ggml_org_Llama_cpp_State_Management

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment