Implementation:Ggml org Llama cpp Lookup Decoding
| Knowledge Sources | |
|---|---|
| Domains | Speculative_Decoding |
| Last Updated | 2026-02-15 00:00 GMT |
Overview
Main prompt lookup decoding implementation that uses n-gram caches to speculatively draft tokens and verify them against the model, accelerating text generation.
Description
Loads static, dynamic, and context-based n-gram caches. During generation, for each step it builds a context n-gram cache from tokens generated so far, drafts up to `n_draft` candidate tokens by looking up n-gram matches across all caches, then decodes the draft batch through the model. Verifies drafted tokens against the model's predictions, accepting matches and falling back to normal sampling on mismatch. Updates the dynamic cache with newly generated tokens and optionally saves it on exit.
Usage
Use this as the primary speculative decoding example using prompt lookup, demonstrating how n-gram statistics from the prompt and prior generations can accelerate inference without requiring a separate draft model.
Code Reference
Source Location
- Repository: Ggml_org_Llama_cpp
- File: examples/lookup/lookup.cpp
- Lines: 1-242
Signature
int main(int argc, char ** argv);
Import
#include "arg.h"
#include "ggml.h"
#include "common.h"
#include "ngram-cache.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"
#include <cstdint>
#include <cstdio>
#include <fstream>
#include <string>
#include <vector>
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| -m | string | Yes | Path to the GGUF model file |
| -p | string | Yes | Input prompt for text generation |
| --lookup-cache-static | string | No | Path to a pre-built static n-gram cache file |
| --lookup-cache-dynamic | string | No | Path to a dynamic n-gram cache file (loaded and saved) |
| -n | int | No | Maximum number of tokens to generate |
| --draft | int | No | Maximum number of draft tokens per speculative step |
Outputs
| Name | Type | Description |
|---|---|---|
| stdout | text | Generated text with speculative decoding acceleration |
| dynamic cache | file | Updated dynamic n-gram cache file (if path specified) |
| return | int | Exit code: 0 on success, 1 on failure |
Usage Examples
# Run lookup decoding with both static and dynamic caches
./build/bin/llama-lookup \
-m model.gguf \
-p "The meaning of life is" \
--lookup-cache-static static_cache.bin \
--lookup-cache-dynamic dynamic_cache.bin \
--draft 10 \
-n 200