Implementation:Ggml org Llama cpp Ngram Map

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Speculative_Decoding, Caching
Last Updated	2026-02-15 00:00 GMT

Overview

Implements two n-gram based self-speculative decoding algorithms: a simple linear search and a hash-map based approach for faster n-gram lookup in token history.

Description

common_ngram_simple_draft performs a linear backward scan through token history to find an n-gram match, then copies the subsequent m-gram as draft tokens. The common_ngram_map_* family of functions maintains a hash map from n-gram hashes to indices in token history, with statistics about m-gram values following each key. The map uses LCG hashing for n-gram fingerprinting and tracks up to COMMON_NGRAM_MAX_VALUES (4) most frequent following sequences per n-gram, selecting the best draft based on occurrence counts and acceptance history.

Usage

Use this module for self-speculative decoding without a separate draft model, leveraging the model's own token history for pattern-based prediction. The map variant offers better performance on long contexts compared to the simple linear scan.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: common/ngram-map.cpp
Lines: 1-530

Signature

static uint32_t common_ngram_map_hash(
    const llama_tokens & tokens, size_t start, size_t len);

llama_tokens common_ngram_simple_draft(
    const common_ngram_simple_config & config,
    const llama_tokens & tokens,
    llama_token sampled);

// Hash-map based n-gram functions
void common_ngram_map_update(/* ... */);
llama_tokens common_ngram_map_draft(/* ... */);

Import

#include "common.h"
#include "log.h"
#include "ngram-map.h"
#include <cinttypes>
#include <cstdint>
#include <sstream>

I/O Contract

Inputs

Name	Type	Required	Description
config	common_ngram_simple_config	Yes	Configuration specifying n-gram size and m-gram (draft) size
tokens	llama_tokens (vector<llama_token>)	Yes	Full token history to search for matching patterns
sampled	llama_token	Yes	The last sampled token, appended to form the search pattern
start	size_t	Yes	Start offset in token array for hash computation
len	size_t	Yes	Length of n-gram for hash computation

Outputs

Name	Type	Description
draft_tokens	llama_tokens (vector<llama_token>)	Vector of predicted draft tokens based on n-gram match, empty if no match found
hash	uint32_t	LCG hash value of an n-gram segment

Usage Examples

#include "ngram-map.h"

// Simple linear search speculative decoding
common_ngram_simple_config config;
config.size_ngram = 4;  // search for 4-gram matches
config.size_mgram = 8;  // use up to 8 following tokens as draft

llama_tokens history = /* ... token history ... */;
llama_token last_token = /* ... last sampled token ... */;

llama_tokens draft = common_ngram_simple_draft(config, history, last_token);
if (!draft.empty()) {
    // Use draft tokens for speculative verification
}

Related Pages

Principle:Ggml_org_Llama_cpp_SpeculativeDecoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment