Implementation:Ggml org Llama cpp Ngram Cache Header
| Knowledge Sources | |
|---|---|
| Domains | Speculative_Decoding, Caching |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Declares data structures and functions for n-gram based speculative decoding caches.
Description
Defines the `common_ngram` struct (a fixed-size token array up to LLAMA_NGRAM_MAX=4), custom hash functions using Fibonacci hashing, and type aliases for the cache structure: `common_ngram_cache_part` maps tokens to occurrence counts, and `common_ngram_cache` maps n-grams to their cache parts. Declares functions for cache update, draft generation, save/load, and merge operations.
Usage
Use this header when implementing n-gram based speculative decoding. Build and maintain n-gram caches from token histories, generate draft token predictions based on observed n-gram patterns, and persist caches to disk for reuse across sessions.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: common/ngram-cache.h
- Lines: 1-101
Signature
struct common_ngram {
llama_token tokens[LLAMA_NGRAM_MAX];
common_ngram();
common_ngram(const llama_token * input, const int ngram_size);
bool operator==(const common_ngram & other) const;
};
struct common_token_hash_function {
size_t operator()(const llama_token token) const;
};
struct common_ngram_hash_function {
size_t operator()(const common_ngram & ngram) const;
};
typedef std::unordered_map<llama_token, int32_t> common_ngram_cache_part;
typedef std::unordered_map<common_ngram, common_ngram_cache_part, common_ngram_hash_function> common_ngram_cache;
void common_ngram_cache_update(
common_ngram_cache & ngram_cache, int ngram_min, int ngram_max,
std::vector<llama_token> & inp_data, int nnew, bool print_progress);
void common_ngram_cache_draft(
std::vector<llama_token> & inp, std::vector<llama_token> & draft, int n_draft,
int ngram_min, int ngram_max,
common_ngram_cache & nc_context, common_ngram_cache & nc_dynamic, common_ngram_cache & nc_static);
void common_ngram_cache_save(common_ngram_cache & ngram_cache, const std::string & filename);
common_ngram_cache common_ngram_cache_load(const std::string & filename);
void common_ngram_cache_merge(common_ngram_cache & ngram_cache_target, common_ngram_cache & ngram_cache_add);
Import
#include "ngram-cache.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| ngram_cache | common_ngram_cache & | Yes | The n-gram cache to update or query |
| ngram_min | int | Yes | Minimum n-gram size to extract (>= LLAMA_NGRAM_MIN) |
| ngram_max | int | Yes | Maximum n-gram size to extract (<= LLAMA_NGRAM_MAX) |
| inp_data | std::vector<llama_token> & | Yes | Token sequence to update from or draft against |
| nnew | int | Yes | Number of new tokens appended since last update |
| n_draft | int | Yes | Maximum number of draft tokens to generate |
| filename | const std::string & | Yes | File path for cache save/load operations |
Outputs
| Name | Type | Description |
|---|---|---|
| draft | std::vector<llama_token> & | Draft token sequence generated from cache lookup |
| loaded cache | common_ngram_cache | Cache loaded from disk via common_ngram_cache_load |
| (side effect) | void | Cache is updated in-place by update/merge operations |
Usage Examples
#include "ngram-cache.h"
// Create and update an n-gram cache
common_ngram_cache cache;
std::vector<llama_token> tokens = {/* token history */};
common_ngram_cache_update(cache, LLAMA_NGRAM_MIN, LLAMA_NGRAM_MAX, tokens, tokens.size(), false);
// Generate draft tokens from the cache
std::vector<llama_token> draft = {last_sampled_token};
common_ngram_cache nc_static;
common_ngram_cache_draft(tokens, draft, 8, LLAMA_NGRAM_MIN, LLAMA_NGRAM_MAX,
cache, cache, nc_static);
// Save and reload cache
common_ngram_cache_save(cache, "ngram_cache.bin");
common_ngram_cache loaded = common_ngram_cache_load("ngram_cache.bin");