Heuristic:Ggml org Ggml Sampling Parameter Defaults
| Knowledge Sources | |
|---|---|
| Domains | LLMs, Optimization |
| Last Updated | 2026-02-10 07:40 GMT |
Overview
Default sampling parameters for text generation: top_k=40, top_p=0.9, temperature=0.9, providing a balanced trade-off between diversity and coherence.
Description
GGML's GPT-2 example establishes default sampling parameters that balance generation quality with diversity. These defaults are conservative, favoring coherent output over creative exploration. The sampling pipeline applies temperature scaling first, then top-k filtering (keeping 40 tokens), then top-p nucleus sampling (keeping tokens summing to 90% probability). Repeat penalty is disabled by default (1.0) with a lookback window of 64 tokens when enabled.
Usage
Use these defaults as a starting point for text generation tasks. Adjust temperature downward (0.1-0.5) for factual or deterministic output. Increase temperature (1.0-1.5) for creative writing. Reduce top_k for more focused generation. Enable repeat penalty (1.1-1.3) if output shows excessive repetition.
The Insight (Rule of Thumb)
- Action: Use temperature=0.9, top_k=40, top_p=0.9 as starting defaults for general-purpose text generation.
- Value: These values produce varied but coherent text for GPT-2 class models.
- Trade-off: Lower temperature increases coherence but reduces creativity; higher top_k increases diversity but may produce off-topic tokens.
- Context size: Default KV cache context is 2048 tokens, which covers most single-turn generation tasks.
- Thread count: Default `n_threads = min(4, hardware_concurrency)` caps CPU usage to avoid starving other processes.
- Batch size: Default prompt processing batch size is 32 tokens, suitable for consumer hardware.
Reasoning
The default values align with common practices in the LLM community:
- top_k=40: Originally from the OpenAI GPT-2 release blog post. Keeps enough tokens for diversity without including very-low-probability noise.
- top_p=0.9: From the Nucleus Sampling paper (Holtzman et al., 2019). Dynamically adapts the candidate set size based on the probability distribution shape.
- temp=0.9: Slightly below 1.0 to sharpen the distribution without making it deterministic. A temperature of 1.0 would use the model's raw probabilities.
- repeat_penalty=1.0: Disabled by default because not all use cases need it, and incorrect values can degrade output quality.
Code Evidence
Default parameter definition from `examples/common.h:18-34`:
struct gpt_params {
int32_t seed = -1; // RNG seed
int32_t n_threads = std::min(4, (int32_t) std::thread::hardware_concurrency());
int32_t n_predict = 200; // new tokens to predict
int32_t n_parallel = 1; // number of parallel streams
int32_t n_batch = 32; // batch size for prompt processing
int32_t n_ctx = 2048; // context size (this is the KV cache max size)
// sampling parameters
int32_t top_k = 40;
float top_p = 0.9f;
float temp = 0.9f;
int32_t repeat_last_n = 64;
float repeat_penalty = 1.00f;
};