Implementation:Ggml org Llama cpp Llama Sampler Sample

Knowledge Sources	Domains	Last Updated
ggml-org/llama.cpp	Token Sampling, Sampler Chain, Logit Processing	2026-02-14

Overview

Description

llama_sampler_sample selects and accepts a token from the logits produced by the most recent llama_decode call. It retrieves the logits for the specified output position, constructs a candidate token array, applies all samplers in the chain to filter and transform the distribution, selects the winning token, and accepts it (updating internal sampler state such as repetition penalty history).

This function is a convenience shorthand that combines logit retrieval, sampler application, token selection, and acceptance into a single call. It also supports backend-accelerated sampling where the token may have already been selected by the compute backend during the decode step.

Usage

#include "llama.h"

// After llama_decode(ctx, batch):
llama_token new_token = llama_sampler_sample(smpl, ctx, -1);
// idx = -1 means the last output position (most common for autoregressive generation)

if (llama_vocab_is_eog(vocab, new_token)) {
    // End of generation
}

Code Reference

Source Location

File	Line(s)	Type
`include/llama.h`	1458	Declaration
`src/llama-sampler.cpp`	806-873	Implementation

Signature

LLAMA_API llama_token llama_sampler_sample(
        struct llama_sampler * smpl,
        struct llama_context * ctx,
        int32_t idx);

Import

#include "llama.h"

I/O Contract

Inputs

Parameter	Type	Description
`smpl`	`struct llama_sampler *`	A sampler or sampler chain. Typically created with `llama_sampler_chain_init` and populated with one or more samplers via `llama_sampler_chain_add`.
`ctx`	`struct llama_context *`	Inference context from which to retrieve logits. Must have completed a `llama_decode` call.
`idx`	`int32_t`	Index of the output position to sample from. Use `-1` to select the last output position (standard for autoregressive generation). For batch processing with multiple output positions, use the specific index.

Outputs

Return	Type	Description
sampled token	`llama_token`	The selected token ID. This token has already been accepted by the sampler (i.e., `llama_sampler_accept` has been called internally), updating any stateful samplers like repetition penalty trackers.

Internal Algorithm

The function performs the following steps (as implemented in src/llama-sampler.cpp:806-873):

Check for backend-sampled token: If a backend sampler has already selected a token during decode, return it immediately without running CPU samplers
Retrieve logits: Get the logit vector for the specified output position from the context
Build candidate array: Create a llama_token_data_array containing each token's ID, logit, and initial probability (0.0)
Apply sampler chain: Call llama_sampler_apply(smpl, &cur_p), which runs each sampler in the chain sequentially
Select token: Read the token at the selected index of the candidate array
Accept token: Call llama_sampler_accept(smpl, token) to update stateful samplers
Return the selected token

Sampler Chain Initialization

Creating a Sampler Chain

// Initialize chain with default parameters
struct llama_sampler_chain_params sparams = llama_sampler_chain_default_params();
sparams.no_perf = false;  // enable performance counters

struct llama_sampler * smpl = llama_sampler_chain_init(sparams);

Adding Samplers to the Chain

// The chain takes ownership of added samplers (do not free them individually)
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.95, 1));
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));
llama_sampler_chain_add(smpl, llama_sampler_init_dist(42));  // final selection with seed

Available Samplers

Sampler	Constructor	Description
Greedy	`llama_sampler_init_greedy()`	Select the highest-probability token
Distribution	`llama_sampler_init_dist(uint32_t seed)`	Probabilistic selection from the distribution
Top-K	`llama_sampler_init_top_k(int32_t k)`	Keep only the k most probable tokens
Top-P (Nucleus)	`llama_sampler_init_top_p(float p, size_t min_keep)`	Keep smallest set with cumulative probability >= p
Min-P	`llama_sampler_init_min_p(float p, size_t min_keep)`	Keep tokens with probability >= p * max_probability
Typical	`llama_sampler_init_typical(float p, size_t min_keep)`	Keep tokens near expected information content
Temperature	`llama_sampler_init_temp(float t)`	Scale logits by 1/t before softmax
Dynamic Temp	`llama_sampler_init_temp_ext(float t, float delta, float exponent)`	Entropy-based adaptive temperature
XTC	`llama_sampler_init_xtc(float p, float t, size_t min_keep, uint32_t seed)`	eXtended Token Control sampler
Top-n-sigma	`llama_sampler_init_top_n_sigma(float n)`	Keep tokens within n standard deviations
Mirostat v1	`llama_sampler_init_mirostat(int32_t n_vocab, uint32_t seed, float tau, float eta, int32_t m)`	Adaptive target-surprise sampling v1
Mirostat v2	`llama_sampler_init_mirostat_v2(uint32_t seed, float tau, float eta)`	Adaptive target-surprise sampling v2
Penalties	`llama_sampler_init_penalties(int32_t last_n, float repeat, float freq, float present)`	Repeat/frequency/presence penalties
DRY	`llama_sampler_init_dry(...)`	Don't Repeat Yourself anti-repetition
Grammar	`llama_sampler_init_grammar(const llama_vocab * vocab, const char * grammar_str, const char * grammar_root)`	GBNF grammar-constrained sampling
Logit Bias	`llama_sampler_init_logit_bias(int32_t n_vocab, int32_t n_logit_bias, const llama_logit_bias * logit_bias)`	Manual logit adjustments
Infill	`llama_sampler_init_infill(const llama_vocab * vocab)`	Fill-in-the-middle optimized sampling
Adaptive-P	`llama_sampler_init_adaptive_p(float target, float decay, uint32_t seed)`	Adaptive target probability sampling

Usage Examples

Greedy Sampling (from examples/simple/simple.cpp)

// Create a sampler chain with just greedy selection
auto sparams = llama_sampler_chain_default_params();
sparams.no_perf = false;

llama_sampler * smpl = llama_sampler_chain_init(sparams);
llama_sampler_chain_add(smpl, llama_sampler_init_greedy());

// Generation loop
for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict; ) {
    if (llama_decode(ctx, batch)) {
        fprintf(stderr, "failed to decode\n");
        return 1;
    }
    n_pos += batch.n_tokens;

    // Sample the next token (idx = -1 for last output position)
    new_token_id = llama_sampler_sample(smpl, ctx, -1);

    if (llama_vocab_is_eog(vocab, new_token_id)) {
        break;
    }

    // Print the token
    char buf[128];
    int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
    std::string s(buf, n);
    printf("%s", s.c_str());
    fflush(stdout);

    // Prepare next batch
    batch = llama_batch_get_one(&new_token_id, 1);
    n_decode += 1;
}

// Cleanup
llama_sampler_free(smpl);

Creative Sampling with Temperature and Top-p

llama_sampler * smpl = llama_sampler_chain_init(llama_sampler_chain_default_params());

// Apply penalties first (need full candidate list)
llama_sampler_chain_add(smpl, llama_sampler_init_penalties(64, 1.1, 0.0, 0.0));

// Filter candidates
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.95, 1));

// Adjust distribution shape
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));

// Final probabilistic selection
llama_sampler_chain_add(smpl, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));

llama_token token = llama_sampler_sample(smpl, ctx, -1);

Mirostat Sampling

llama_sampler * smpl = llama_sampler_chain_init(llama_sampler_chain_default_params());

// Mirostat v2 is a selecting sampler (should be the only/last sampler)
llama_sampler_chain_add(smpl, llama_sampler_init_mirostat_v2(
    42,     // seed
    5.0,    // tau (target surprise)
    0.1     // eta (learning rate)
));

llama_token token = llama_sampler_sample(smpl, ctx, -1);

Related Pages

Principle:Ggml_org_Llama_cpp_Token_Sampling
Implementation:Ggml_org_Llama_cpp_Llama_Decode -- produces the logits consumed by sampling
Implementation:Ggml_org_Llama_cpp_Llama_Tokenize -- tokenized prompt feeds the generation loop

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment