Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Ngram Cache Header

From Leeroopedia
Knowledge Sources
Domains Speculative_Decoding, Caching
Last Updated 2026-02-15 00:00 GMT

Overview

Declares data structures and functions for n-gram based speculative decoding caches.

Description

Defines the `common_ngram` struct (a fixed-size token array up to LLAMA_NGRAM_MAX=4), custom hash functions using Fibonacci hashing, and type aliases for the cache structure: `common_ngram_cache_part` maps tokens to occurrence counts, and `common_ngram_cache` maps n-grams to their cache parts. Declares functions for cache update, draft generation, save/load, and merge operations.

Usage

Use this header when implementing n-gram based speculative decoding. Build and maintain n-gram caches from token histories, generate draft token predictions based on observed n-gram patterns, and persist caches to disk for reuse across sessions.

Code Reference

Source Location

Signature

struct common_ngram {
    llama_token tokens[LLAMA_NGRAM_MAX];
    common_ngram();
    common_ngram(const llama_token * input, const int ngram_size);
    bool operator==(const common_ngram & other) const;
};

struct common_token_hash_function {
    size_t operator()(const llama_token token) const;
};

struct common_ngram_hash_function {
    size_t operator()(const common_ngram & ngram) const;
};

typedef std::unordered_map<llama_token, int32_t> common_ngram_cache_part;
typedef std::unordered_map<common_ngram, common_ngram_cache_part, common_ngram_hash_function> common_ngram_cache;

void common_ngram_cache_update(
    common_ngram_cache & ngram_cache, int ngram_min, int ngram_max,
    std::vector<llama_token> & inp_data, int nnew, bool print_progress);

void common_ngram_cache_draft(
    std::vector<llama_token> & inp, std::vector<llama_token> & draft, int n_draft,
    int ngram_min, int ngram_max,
    common_ngram_cache & nc_context, common_ngram_cache & nc_dynamic, common_ngram_cache & nc_static);

void common_ngram_cache_save(common_ngram_cache & ngram_cache, const std::string & filename);
common_ngram_cache common_ngram_cache_load(const std::string & filename);
void common_ngram_cache_merge(common_ngram_cache & ngram_cache_target, common_ngram_cache & ngram_cache_add);

Import

#include "ngram-cache.h"

I/O Contract

Inputs

Name Type Required Description
ngram_cache common_ngram_cache & Yes The n-gram cache to update or query
ngram_min int Yes Minimum n-gram size to extract (>= LLAMA_NGRAM_MIN)
ngram_max int Yes Maximum n-gram size to extract (<= LLAMA_NGRAM_MAX)
inp_data std::vector<llama_token> & Yes Token sequence to update from or draft against
nnew int Yes Number of new tokens appended since last update
n_draft int Yes Maximum number of draft tokens to generate
filename const std::string & Yes File path for cache save/load operations

Outputs

Name Type Description
draft std::vector<llama_token> & Draft token sequence generated from cache lookup
loaded cache common_ngram_cache Cache loaded from disk via common_ngram_cache_load
(side effect) void Cache is updated in-place by update/merge operations

Usage Examples

#include "ngram-cache.h"

// Create and update an n-gram cache
common_ngram_cache cache;
std::vector<llama_token> tokens = {/* token history */};
common_ngram_cache_update(cache, LLAMA_NGRAM_MIN, LLAMA_NGRAM_MAX, tokens, tokens.size(), false);

// Generate draft tokens from the cache
std::vector<llama_token> draft = {last_sampled_token};
common_ngram_cache nc_static;
common_ngram_cache_draft(tokens, draft, 8, LLAMA_NGRAM_MIN, LLAMA_NGRAM_MAX,
                         cache, cache, nc_static);

// Save and reload cache
common_ngram_cache_save(cache, "ngram_cache.bin");
common_ngram_cache loaded = common_ngram_cache_load("ngram_cache.bin");

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment