Heuristic:Ggml org Ggml Sampling Parameter Defaults

Knowledge Sources	GGML GPT-2 example defaults
Domains	LLMs, Optimization
Last Updated	2026-02-10 07:40 GMT

Overview

Default sampling parameters for text generation: top_k=40, top_p=0.9, temperature=0.9, providing a balanced trade-off between diversity and coherence.

Description

GGML's GPT-2 example establishes default sampling parameters that balance generation quality with diversity. These defaults are conservative, favoring coherent output over creative exploration. The sampling pipeline applies temperature scaling first, then top-k filtering (keeping 40 tokens), then top-p nucleus sampling (keeping tokens summing to 90% probability). Repeat penalty is disabled by default (1.0) with a lookback window of 64 tokens when enabled.

Usage

Use these defaults as a starting point for text generation tasks. Adjust temperature downward (0.1-0.5) for factual or deterministic output. Increase temperature (1.0-1.5) for creative writing. Reduce top_k for more focused generation. Enable repeat penalty (1.1-1.3) if output shows excessive repetition.

The Insight (Rule of Thumb)

Action: Use temperature=0.9, top_k=40, top_p=0.9 as starting defaults for general-purpose text generation.
Value: These values produce varied but coherent text for GPT-2 class models.
Trade-off: Lower temperature increases coherence but reduces creativity; higher top_k increases diversity but may produce off-topic tokens.
Context size: Default KV cache context is 2048 tokens, which covers most single-turn generation tasks.
Thread count: Default `n_threads = min(4, hardware_concurrency)` caps CPU usage to avoid starving other processes.
Batch size: Default prompt processing batch size is 32 tokens, suitable for consumer hardware.

Reasoning

The default values align with common practices in the LLM community:

top_k=40: Originally from the OpenAI GPT-2 release blog post. Keeps enough tokens for diversity without including very-low-probability noise.
top_p=0.9: From the Nucleus Sampling paper (Holtzman et al., 2019). Dynamically adapts the candidate set size based on the probability distribution shape.
temp=0.9: Slightly below 1.0 to sharpen the distribution without making it deterministic. A temperature of 1.0 would use the model's raw probabilities.
repeat_penalty=1.0: Disabled by default because not all use cases need it, and incorrect values can degrade output quality.

Code Evidence

Default parameter definition from `examples/common.h:18-34`:

struct gpt_params {
    int32_t seed         = -1;   // RNG seed
    int32_t n_threads    = std::min(4, (int32_t) std::thread::hardware_concurrency());
    int32_t n_predict    = 200;  // new tokens to predict
    int32_t n_parallel   = 1;    // number of parallel streams
    int32_t n_batch      = 32;   // batch size for prompt processing
    int32_t n_ctx        = 2048; // context size (this is the KV cache max size)

    // sampling parameters
    int32_t top_k          = 40;
    float   top_p          = 0.9f;
    float   temp           = 0.9f;
    int32_t repeat_last_n  = 64;
    float   repeat_penalty = 1.00f;
};

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment