Implementation:Ggml org Llama cpp Ngram Cache Header

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Speculative_Decoding, Caching
Last Updated	2026-02-15 00:00 GMT

Overview

Declares data structures and functions for n-gram based speculative decoding caches.

Description

Defines the `common_ngram` struct (a fixed-size token array up to LLAMA_NGRAM_MAX=4), custom hash functions using Fibonacci hashing, and type aliases for the cache structure: `common_ngram_cache_part` maps tokens to occurrence counts, and `common_ngram_cache` maps n-grams to their cache parts. Declares functions for cache update, draft generation, save/load, and merge operations.

Usage

Use this header when implementing n-gram based speculative decoding. Build and maintain n-gram caches from token histories, generate draft token predictions based on observed n-gram patterns, and persist caches to disk for reuse across sessions.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: common/ngram-cache.h
Lines: 1-101

Signature

struct common_ngram {
    llama_token tokens[LLAMA_NGRAM_MAX];
    common_ngram();
    common_ngram(const llama_token * input, const int ngram_size);
    bool operator==(const common_ngram & other) const;
};

struct common_token_hash_function {
    size_t operator()(const llama_token token) const;
};

struct common_ngram_hash_function {
    size_t operator()(const common_ngram & ngram) const;
};

typedef std::unordered_map<llama_token, int32_t> common_ngram_cache_part;
typedef std::unordered_map<common_ngram, common_ngram_cache_part, common_ngram_hash_function> common_ngram_cache;

void common_ngram_cache_update(
    common_ngram_cache & ngram_cache, int ngram_min, int ngram_max,
    std::vector<llama_token> & inp_data, int nnew, bool print_progress);

void common_ngram_cache_draft(
    std::vector<llama_token> & inp, std::vector<llama_token> & draft, int n_draft,
    int ngram_min, int ngram_max,
    common_ngram_cache & nc_context, common_ngram_cache & nc_dynamic, common_ngram_cache & nc_static);

void common_ngram_cache_save(common_ngram_cache & ngram_cache, const std::string & filename);
common_ngram_cache common_ngram_cache_load(const std::string & filename);
void common_ngram_cache_merge(common_ngram_cache & ngram_cache_target, common_ngram_cache & ngram_cache_add);

Import

#include "ngram-cache.h"

I/O Contract

Inputs

Name	Type	Required	Description
ngram_cache	common_ngram_cache &	Yes	The n-gram cache to update or query
ngram_min	int	Yes	Minimum n-gram size to extract (>= LLAMA_NGRAM_MIN)
ngram_max	int	Yes	Maximum n-gram size to extract (<= LLAMA_NGRAM_MAX)
inp_data	std::vector<llama_token> &	Yes	Token sequence to update from or draft against
nnew	int	Yes	Number of new tokens appended since last update
n_draft	int	Yes	Maximum number of draft tokens to generate
filename	const std::string &	Yes	File path for cache save/load operations

Outputs

Name	Type	Description
draft	std::vector<llama_token> &	Draft token sequence generated from cache lookup
loaded cache	common_ngram_cache	Cache loaded from disk via common_ngram_cache_load
(side effect)	void	Cache is updated in-place by update/merge operations

Usage Examples

#include "ngram-cache.h"

// Create and update an n-gram cache
common_ngram_cache cache;
std::vector<llama_token> tokens = {/* token history */};
common_ngram_cache_update(cache, LLAMA_NGRAM_MIN, LLAMA_NGRAM_MAX, tokens, tokens.size(), false);

// Generate draft tokens from the cache
std::vector<llama_token> draft = {last_sampled_token};
common_ngram_cache nc_static;
common_ngram_cache_draft(tokens, draft, 8, LLAMA_NGRAM_MIN, LLAMA_NGRAM_MAX,
                         cache, cache, nc_static);

// Save and reload cache
common_ngram_cache_save(cache, "ngram_cache.bin");
common_ngram_cache loaded = common_ngram_cache_load("ngram_cache.bin");

Related Pages

Principle:Ggml_org_Llama_cpp_Speculative_Decoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment