Implementation:Ggml org Llama cpp Lookahead Decoding

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Speculative_Decoding
Last Updated	2026-02-15 00:00 GMT

Overview

Implements lookahead decoding, a speculative decoding technique that uses n-gram patterns from the generation history to predict and verify multiple tokens in parallel.

Description

Uses a Jacobi iteration approach with a lookahead window (W=15), n-gram size (N=5), and verification count (G=15). Maintains an `ngram_container` that stores per-token ring buffers of observed n-gram continuations. During generation, it fills a batch with both the lookahead window tokens and candidate n-gram verification sequences, decodes them in parallel using W+G+1 sequences, then verifies which candidates match the model's actual predictions. Accepted tokens extend the output while updating the n-gram cache for future predictions.

Usage

Use this example when you want to accelerate autoregressive generation without requiring a separate draft model, by exploiting repetitive patterns in the model's own output history. Requires a unified KV cache for coupled sequence handling.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/lookahead/lookahead.cpp
Lines: 1-480

Signature

struct ngram_data {
    bool active = false;
    llama_seq_id seq_id = -1;
    std::vector<int> i_batch;
    std::vector<llama_token> tokens;
};

struct ngram_container {
    ngram_container(int n_vocab, int N, int G);
    int n_total = 0;
    std::vector<int> cnt;
    std::vector<int> head;
    std::vector<llama_token> tokens;  // [n_vocab][G][N - 1]
};

int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"
#include <cstdio>
#include <string>
#include <vector>
#include <algorithm>

I/O Contract

Inputs

Name	Type	Required	Description
params.prompt	std::string	Yes	Input prompt text to begin generation from
params.model	std::string	Yes	Path to the GGUF model file
W	int	No	Lookahead window size (default: 15)
N	int	No	N-gram size for pattern matching (default: 5)
G	int	No	Maximum number of verification n-grams per step (default: 15)

Outputs

Name	Type	Description
generated_text	stdout	Generated text tokens printed to standard output
statistics	stdout	Performance statistics including n_predict, n_accept, token generation speed

Usage Examples

# Run lookahead decoding with a model
./llama-lookahead -m model.gguf -p "Once upon a time" -n 256

# The program automatically configures:
# - W+G+1 = 31 parallel sequences
# - Unified KV cache for coupled sequence handling
# - N-gram pattern collection from generation history

Related Pages

Principle:Ggml_org_Llama_cpp_Speculative_Decoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment