Implementation:Ggml org Llama cpp Lookahead Decoding
| Knowledge Sources | |
|---|---|
| Domains | Speculative_Decoding |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Implements lookahead decoding, a speculative decoding technique that uses n-gram patterns from the generation history to predict and verify multiple tokens in parallel.
Description
Uses a Jacobi iteration approach with a lookahead window (W=15), n-gram size (N=5), and verification count (G=15). Maintains an `ngram_container` that stores per-token ring buffers of observed n-gram continuations. During generation, it fills a batch with both the lookahead window tokens and candidate n-gram verification sequences, decodes them in parallel using W+G+1 sequences, then verifies which candidates match the model's actual predictions. Accepted tokens extend the output while updating the n-gram cache for future predictions.
Usage
Use this example when you want to accelerate autoregressive generation without requiring a separate draft model, by exploiting repetitive patterns in the model's own output history. Requires a unified KV cache for coupled sequence handling.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/lookahead/lookahead.cpp
- Lines: 1-480
Signature
struct ngram_data {
bool active = false;
llama_seq_id seq_id = -1;
std::vector<int> i_batch;
std::vector<llama_token> tokens;
};
struct ngram_container {
ngram_container(int n_vocab, int N, int G);
int n_total = 0;
std::vector<int> cnt;
std::vector<int> head;
std::vector<llama_token> tokens; // [n_vocab][G][N - 1]
};
int main(int argc, char ** argv);
Import
#include "arg.h"
#include "common.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"
#include <cstdio>
#include <string>
#include <vector>
#include <algorithm>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| params.prompt | std::string | Yes | Input prompt text to begin generation from |
| params.model | std::string | Yes | Path to the GGUF model file |
| W | int | No | Lookahead window size (default: 15) |
| N | int | No | N-gram size for pattern matching (default: 5) |
| G | int | No | Maximum number of verification n-grams per step (default: 15) |
Outputs
| Name | Type | Description |
|---|---|---|
| generated_text | stdout | Generated text tokens printed to standard output |
| statistics | stdout | Performance statistics including n_predict, n_accept, token generation speed |
Usage Examples
# Run lookahead decoding with a model
./llama-lookahead -m model.gguf -p "Once upon a time" -n 256
# The program automatically configures:
# - W+G+1 = 31 parallel sequences
# - Unified KV cache for coupled sequence handling
# - N-gram pattern collection from generation history