Jump to content

Connect Leeroopedia MCP: Equip your AI agents to search best practices, build plans, verify code, diagnose failures, and look up hyperparameter defaults.

Implementation:Ggml org Llama cpp Llama Sampler Sample

From Leeroopedia
Knowledge Sources Domains Last Updated
ggml-org/llama.cpp Token Sampling, Sampler Chain, Logit Processing 2026-02-14

Overview

Description

llama_sampler_sample selects and accepts a token from the logits produced by the most recent llama_decode call. It retrieves the logits for the specified output position, constructs a candidate token array, applies all samplers in the chain to filter and transform the distribution, selects the winning token, and accepts it (updating internal sampler state such as repetition penalty history).

This function is a convenience shorthand that combines logit retrieval, sampler application, token selection, and acceptance into a single call. It also supports backend-accelerated sampling where the token may have already been selected by the compute backend during the decode step.

Usage

#include "llama.h"

// After llama_decode(ctx, batch):
llama_token new_token = llama_sampler_sample(smpl, ctx, -1);
// idx = -1 means the last output position (most common for autoregressive generation)

if (llama_vocab_is_eog(vocab, new_token)) {
    // End of generation
}

Code Reference

Source Location

File Line(s) Type
include/llama.h 1458 Declaration
src/llama-sampler.cpp 806-873 Implementation

Signature

LLAMA_API llama_token llama_sampler_sample(
        struct llama_sampler * smpl,
        struct llama_context * ctx,
        int32_t idx);

Import

#include "llama.h"

I/O Contract

Inputs

Parameter Type Description
smpl struct llama_sampler * A sampler or sampler chain. Typically created with llama_sampler_chain_init and populated with one or more samplers via llama_sampler_chain_add.
ctx struct llama_context * Inference context from which to retrieve logits. Must have completed a llama_decode call.
idx int32_t Index of the output position to sample from. Use -1 to select the last output position (standard for autoregressive generation). For batch processing with multiple output positions, use the specific index.

Outputs

Return Type Description
sampled token llama_token The selected token ID. This token has already been accepted by the sampler (i.e., llama_sampler_accept has been called internally), updating any stateful samplers like repetition penalty trackers.

Internal Algorithm

The function performs the following steps (as implemented in src/llama-sampler.cpp:806-873):

  1. Check for backend-sampled token: If a backend sampler has already selected a token during decode, return it immediately without running CPU samplers
  2. Retrieve logits: Get the logit vector for the specified output position from the context
  3. Build candidate array: Create a llama_token_data_array containing each token's ID, logit, and initial probability (0.0)
  4. Apply sampler chain: Call llama_sampler_apply(smpl, &cur_p), which runs each sampler in the chain sequentially
  5. Select token: Read the token at the selected index of the candidate array
  6. Accept token: Call llama_sampler_accept(smpl, token) to update stateful samplers
  7. Return the selected token

Sampler Chain Initialization

Creating a Sampler Chain

// Initialize chain with default parameters
struct llama_sampler_chain_params sparams = llama_sampler_chain_default_params();
sparams.no_perf = false;  // enable performance counters

struct llama_sampler * smpl = llama_sampler_chain_init(sparams);

Adding Samplers to the Chain

// The chain takes ownership of added samplers (do not free them individually)
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.95, 1));
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));
llama_sampler_chain_add(smpl, llama_sampler_init_dist(42));  // final selection with seed

Available Samplers

Sampler Constructor Description
Greedy llama_sampler_init_greedy() Select the highest-probability token
Distribution llama_sampler_init_dist(uint32_t seed) Probabilistic selection from the distribution
Top-K llama_sampler_init_top_k(int32_t k) Keep only the k most probable tokens
Top-P (Nucleus) llama_sampler_init_top_p(float p, size_t min_keep) Keep smallest set with cumulative probability >= p
Min-P llama_sampler_init_min_p(float p, size_t min_keep) Keep tokens with probability >= p * max_probability
Typical llama_sampler_init_typical(float p, size_t min_keep) Keep tokens near expected information content
Temperature llama_sampler_init_temp(float t) Scale logits by 1/t before softmax
Dynamic Temp llama_sampler_init_temp_ext(float t, float delta, float exponent) Entropy-based adaptive temperature
XTC llama_sampler_init_xtc(float p, float t, size_t min_keep, uint32_t seed) eXtended Token Control sampler
Top-n-sigma llama_sampler_init_top_n_sigma(float n) Keep tokens within n standard deviations
Mirostat v1 llama_sampler_init_mirostat(int32_t n_vocab, uint32_t seed, float tau, float eta, int32_t m) Adaptive target-surprise sampling v1
Mirostat v2 llama_sampler_init_mirostat_v2(uint32_t seed, float tau, float eta) Adaptive target-surprise sampling v2
Penalties llama_sampler_init_penalties(int32_t last_n, float repeat, float freq, float present) Repeat/frequency/presence penalties
DRY llama_sampler_init_dry(...) Don't Repeat Yourself anti-repetition
Grammar llama_sampler_init_grammar(const llama_vocab * vocab, const char * grammar_str, const char * grammar_root) GBNF grammar-constrained sampling
Logit Bias llama_sampler_init_logit_bias(int32_t n_vocab, int32_t n_logit_bias, const llama_logit_bias * logit_bias) Manual logit adjustments
Infill llama_sampler_init_infill(const llama_vocab * vocab) Fill-in-the-middle optimized sampling
Adaptive-P llama_sampler_init_adaptive_p(float target, float decay, uint32_t seed) Adaptive target probability sampling

Usage Examples

Greedy Sampling (from examples/simple/simple.cpp)

// Create a sampler chain with just greedy selection
auto sparams = llama_sampler_chain_default_params();
sparams.no_perf = false;

llama_sampler * smpl = llama_sampler_chain_init(sparams);
llama_sampler_chain_add(smpl, llama_sampler_init_greedy());

// Generation loop
for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict; ) {
    if (llama_decode(ctx, batch)) {
        fprintf(stderr, "failed to decode\n");
        return 1;
    }
    n_pos += batch.n_tokens;

    // Sample the next token (idx = -1 for last output position)
    new_token_id = llama_sampler_sample(smpl, ctx, -1);

    if (llama_vocab_is_eog(vocab, new_token_id)) {
        break;
    }

    // Print the token
    char buf[128];
    int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
    std::string s(buf, n);
    printf("%s", s.c_str());
    fflush(stdout);

    // Prepare next batch
    batch = llama_batch_get_one(&new_token_id, 1);
    n_decode += 1;
}

// Cleanup
llama_sampler_free(smpl);

Creative Sampling with Temperature and Top-p

llama_sampler * smpl = llama_sampler_chain_init(llama_sampler_chain_default_params());

// Apply penalties first (need full candidate list)
llama_sampler_chain_add(smpl, llama_sampler_init_penalties(64, 1.1, 0.0, 0.0));

// Filter candidates
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.95, 1));

// Adjust distribution shape
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));

// Final probabilistic selection
llama_sampler_chain_add(smpl, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));

llama_token token = llama_sampler_sample(smpl, ctx, -1);

Mirostat Sampling

llama_sampler * smpl = llama_sampler_chain_init(llama_sampler_chain_default_params());

// Mirostat v2 is a selecting sampler (should be the only/last sampler)
llama_sampler_chain_add(smpl, llama_sampler_init_mirostat_v2(
    42,     // seed
    5.0,    // tau (target surprise)
    0.1     // eta (learning rate)
));

llama_token token = llama_sampler_sample(smpl, ctx, -1);

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment