Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Lookup Decoding

From Leeroopedia
Knowledge Sources
Domains Speculative_Decoding
Last Updated 2026-02-15 00:00 GMT

Overview

Main prompt lookup decoding implementation that uses n-gram caches to speculatively draft tokens and verify them against the model, accelerating text generation.

Description

Loads static, dynamic, and context-based n-gram caches. During generation, for each step it builds a context n-gram cache from tokens generated so far, drafts up to `n_draft` candidate tokens by looking up n-gram matches across all caches, then decodes the draft batch through the model. Verifies drafted tokens against the model's predictions, accepting matches and falling back to normal sampling on mismatch. Updates the dynamic cache with newly generated tokens and optionally saves it on exit.

Usage

Use this as the primary speculative decoding example using prompt lookup, demonstrating how n-gram statistics from the prompt and prior generations can accelerate inference without requiring a separate draft model.

Code Reference

Source Location

Signature

int main(int argc, char ** argv);

Import

#include "arg.h"
#include "ggml.h"
#include "common.h"
#include "ngram-cache.h"
#include "sampling.h"
#include "log.h"
#include "llama.h"

#include <cstdint>
#include <cstdio>
#include <fstream>
#include <string>
#include <vector>

I/O Contract

Inputs

Name Type Required Description
-m string Yes Path to the GGUF model file
-p string Yes Input prompt for text generation
--lookup-cache-static string No Path to a pre-built static n-gram cache file
--lookup-cache-dynamic string No Path to a dynamic n-gram cache file (loaded and saved)
-n int No Maximum number of tokens to generate
--draft int No Maximum number of draft tokens per speculative step

Outputs

Name Type Description
stdout text Generated text with speculative decoding acceleration
dynamic cache file Updated dynamic n-gram cache file (if path specified)
return int Exit code: 0 on success, 1 on failure

Usage Examples

# Run lookup decoding with both static and dynamic caches
./build/bin/llama-lookup \
  -m model.gguf \
  -p "The meaning of life is" \
  --lookup-cache-static static_cache.bin \
  --lookup-cache-dynamic dynamic_cache.bin \
  --draft 10 \
  -n 200

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment