Implementation:Ggml org Llama cpp Ngram Map
| Knowledge Sources | |
|---|---|
| Domains | Speculative_Decoding, Caching |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements two n-gram based self-speculative decoding algorithms: a simple linear search and a hash-map based approach for faster n-gram lookup in token history.
Description
common_ngram_simple_draft performs a linear backward scan through token history to find an n-gram match, then copies the subsequent m-gram as draft tokens. The common_ngram_map_* family of functions maintains a hash map from n-gram hashes to indices in token history, with statistics about m-gram values following each key. The map uses LCG hashing for n-gram fingerprinting and tracks up to COMMON_NGRAM_MAX_VALUES (4) most frequent following sequences per n-gram, selecting the best draft based on occurrence counts and acceptance history.
Usage
Use this module for self-speculative decoding without a separate draft model, leveraging the model's own token history for pattern-based prediction. The map variant offers better performance on long contexts compared to the simple linear scan.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: common/ngram-map.cpp
- Lines: 1-530
Signature
static uint32_t common_ngram_map_hash(
const llama_tokens & tokens, size_t start, size_t len);
llama_tokens common_ngram_simple_draft(
const common_ngram_simple_config & config,
const llama_tokens & tokens,
llama_token sampled);
// Hash-map based n-gram functions
void common_ngram_map_update(/* ... */);
llama_tokens common_ngram_map_draft(/* ... */);
Import
#include "common.h"
#include "log.h"
#include "ngram-map.h"
#include <cinttypes>
#include <cstdint>
#include <sstream>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| config | common_ngram_simple_config | Yes | Configuration specifying n-gram size and m-gram (draft) size |
| tokens | llama_tokens (vector<llama_token>) | Yes | Full token history to search for matching patterns |
| sampled | llama_token | Yes | The last sampled token, appended to form the search pattern |
| start | size_t | Yes | Start offset in token array for hash computation |
| len | size_t | Yes | Length of n-gram for hash computation |
Outputs
| Name | Type | Description |
|---|---|---|
| draft_tokens | llama_tokens (vector<llama_token>) | Vector of predicted draft tokens based on n-gram match, empty if no match found |
| hash | uint32_t | LCG hash value of an n-gram segment |
Usage Examples
#include "ngram-map.h"
// Simple linear search speculative decoding
common_ngram_simple_config config;
config.size_ngram = 4; // search for 4-gram matches
config.size_mgram = 8; // use up to 8 following tokens as draft
llama_tokens history = /* ... token history ... */;
llama_token last_token = /* ... last sampled token ... */;
llama_tokens draft = common_ngram_simple_draft(config, history, last_token);
if (!draft.empty()) {
// Use draft tokens for speculative verification
}