Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Lookahead Decoding

From Leeroopedia
Knowledge Sources
Domains Speculative_Decoding
Last Updated 2026-02-15 00:00 GMT

Overview

Implements lookahead decoding, a speculative decoding technique that uses n-gram patterns from the generation history to predict and verify multiple tokens in parallel.

Description

Uses a Jacobi iteration approach with a lookahead window (W=15), n-gram size (N=5), and verification count (G=15). Maintains an `ngram_container` that stores per-token ring buffers of observed n-gram continuations. During generation, it fills a batch with both the lookahead window tokens and candidate n-gram verification sequences, decodes them in parallel using W+G+1 sequences, then verifies which candidates match the model's actual predictions. Accepted tokens extend the output while updating the n-gram cache for future predictions.

Usage

Use this example when you want to accelerate autoregressive generation without requiring a separate draft model, by exploiting repetitive patterns in the model's own output history. Requires a unified KV cache for coupled sequence handling.

Code Reference

Source Location

Signature

struct ngram_data {
    bool active = false;
    llama_seq_id seq_id = -1;
    std::vector<int> i_batch;
    std::vector<llama_token> tokens;
};

struct ngram_container {
    ngram_container(int n_vocab, int N, int G);
    int n_total = 0;
    std::vector<int> cnt;
    std::vector<int> head;
    std::vector<llama_token> tokens;  // [n_vocab][G][N - 1]
};

int main(int argc, char ** argv);

Import

#include "arg.h"
#include "common.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"
#include <cstdio>
#include <string>
#include <vector>
#include <algorithm>

I/O Contract

Inputs

Name Type Required Description
params.prompt std::string Yes Input prompt text to begin generation from
params.model std::string Yes Path to the GGUF model file
W int No Lookahead window size (default: 15)
N int No N-gram size for pattern matching (default: 5)
G int No Maximum number of verification n-grams per step (default: 15)

Outputs

Name Type Description
generated_text stdout Generated text tokens printed to standard output
statistics stdout Performance statistics including n_predict, n_accept, token generation speed

Usage Examples

# Run lookahead decoding with a model
./llama-lookahead -m model.gguf -p "Once upon a time" -n 256

# The program automatically configures:
# - W+G+1 = 31 parallel sequences
# - Unified KV cache for coupled sequence handling
# - N-gram pattern collection from generation history

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment