Implementation:Ggml org Llama cpp Lookup Decoding

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Speculative_Decoding
Last Updated	2026-02-15 00:00 GMT

Overview

Main prompt lookup decoding implementation that uses n-gram caches to speculatively draft tokens and verify them against the model, accelerating text generation.

Description

Loads static, dynamic, and context-based n-gram caches. During generation, for each step it builds a context n-gram cache from tokens generated so far, drafts up to `n_draft` candidate tokens by looking up n-gram matches across all caches, then decodes the draft batch through the model. Verifies drafted tokens against the model's predictions, accepting matches and falling back to normal sampling on mismatch. Updates the dynamic cache with newly generated tokens and optionally saves it on exit.

Usage

Use this as the primary speculative decoding example using prompt lookup, demonstrating how n-gram statistics from the prompt and prior generations can accelerate inference without requiring a separate draft model.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: examples/lookup/lookup.cpp
Lines: 1-242

Signature

int main(int argc, char ** argv);

Import

#include "arg.h"
#include "ggml.h"
#include "common.h"
#include "ngram-cache.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"

#include <cstdint>
#include <cstdio>
#include <fstream>
#include <string>
#include <vector>

I/O Contract

Inputs

Name	Type	Required	Description
-m	string	Yes	Path to the GGUF model file
-p	string	Yes	Input prompt for text generation
--lookup-cache-static	string	No	Path to a pre-built static n-gram cache file
--lookup-cache-dynamic	string	No	Path to a dynamic n-gram cache file (loaded and saved)
-n	int	No	Maximum number of tokens to generate
--draft	int	No	Maximum number of draft tokens per speculative step

Outputs

Name	Type	Description
stdout	text	Generated text with speculative decoding acceleration
dynamic cache	file	Updated dynamic n-gram cache file (if path specified)
return	int	Exit code: 0 on success, 1 on failure

Usage Examples

# Run lookup decoding with both static and dynamic caches
./build/bin/llama-lookup \
  -m model.gguf \
  -p "The meaning of life is" \
  --lookup-cache-static static_cache.bin \
  --lookup-cache-dynamic dynamic_cache.bin \
  --draft 10 \
  -n 200

Related Pages

Principle:Ggml_org_Llama_cpp_Speculative_Decoding

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment