| Knowledge Sources |
Domains |
Last Updated
|
| ggml-org/llama.cpp |
Token Sampling, Sampler Chain, Logit Processing |
2026-02-14
|
Overview
Description
llama_sampler_sample selects and accepts a token from the logits produced by the most recent llama_decode call. It retrieves the logits for the specified output position, constructs a candidate token array, applies all samplers in the chain to filter and transform the distribution, selects the winning token, and accepts it (updating internal sampler state such as repetition penalty history).
This function is a convenience shorthand that combines logit retrieval, sampler application, token selection, and acceptance into a single call. It also supports backend-accelerated sampling where the token may have already been selected by the compute backend during the decode step.
Usage
#include "llama.h"
// After llama_decode(ctx, batch):
llama_token new_token = llama_sampler_sample(smpl, ctx, -1);
// idx = -1 means the last output position (most common for autoregressive generation)
if (llama_vocab_is_eog(vocab, new_token)) {
// End of generation
}
Code Reference
Source Location
| File |
Line(s) |
Type
|
include/llama.h |
1458 |
Declaration
|
src/llama-sampler.cpp |
806-873 |
Implementation
|
Signature
LLAMA_API llama_token llama_sampler_sample(
struct llama_sampler * smpl,
struct llama_context * ctx,
int32_t idx);
Import
I/O Contract
Inputs
| Parameter |
Type |
Description
|
smpl |
struct llama_sampler * |
A sampler or sampler chain. Typically created with llama_sampler_chain_init and populated with one or more samplers via llama_sampler_chain_add.
|
ctx |
struct llama_context * |
Inference context from which to retrieve logits. Must have completed a llama_decode call.
|
idx |
int32_t |
Index of the output position to sample from. Use -1 to select the last output position (standard for autoregressive generation). For batch processing with multiple output positions, use the specific index.
|
Outputs
| Return |
Type |
Description
|
| sampled token |
llama_token |
The selected token ID. This token has already been accepted by the sampler (i.e., llama_sampler_accept has been called internally), updating any stateful samplers like repetition penalty trackers.
|
Internal Algorithm
The function performs the following steps (as implemented in src/llama-sampler.cpp:806-873):
- Check for backend-sampled token: If a backend sampler has already selected a token during decode, return it immediately without running CPU samplers
- Retrieve logits: Get the logit vector for the specified output position from the context
- Build candidate array: Create a
llama_token_data_array containing each token's ID, logit, and initial probability (0.0)
- Apply sampler chain: Call
llama_sampler_apply(smpl, &cur_p), which runs each sampler in the chain sequentially
- Select token: Read the token at the
selected index of the candidate array
- Accept token: Call
llama_sampler_accept(smpl, token) to update stateful samplers
- Return the selected token
Sampler Chain Initialization
Creating a Sampler Chain
// Initialize chain with default parameters
struct llama_sampler_chain_params sparams = llama_sampler_chain_default_params();
sparams.no_perf = false; // enable performance counters
struct llama_sampler * smpl = llama_sampler_chain_init(sparams);
Adding Samplers to the Chain
// The chain takes ownership of added samplers (do not free them individually)
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.95, 1));
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));
llama_sampler_chain_add(smpl, llama_sampler_init_dist(42)); // final selection with seed
Available Samplers
| Sampler |
Constructor |
Description
|
| Greedy |
llama_sampler_init_greedy() |
Select the highest-probability token
|
| Distribution |
llama_sampler_init_dist(uint32_t seed) |
Probabilistic selection from the distribution
|
| Top-K |
llama_sampler_init_top_k(int32_t k) |
Keep only the k most probable tokens
|
| Top-P (Nucleus) |
llama_sampler_init_top_p(float p, size_t min_keep) |
Keep smallest set with cumulative probability >= p
|
| Min-P |
llama_sampler_init_min_p(float p, size_t min_keep) |
Keep tokens with probability >= p * max_probability
|
| Typical |
llama_sampler_init_typical(float p, size_t min_keep) |
Keep tokens near expected information content
|
| Temperature |
llama_sampler_init_temp(float t) |
Scale logits by 1/t before softmax
|
| Dynamic Temp |
llama_sampler_init_temp_ext(float t, float delta, float exponent) |
Entropy-based adaptive temperature
|
| XTC |
llama_sampler_init_xtc(float p, float t, size_t min_keep, uint32_t seed) |
eXtended Token Control sampler
|
| Top-n-sigma |
llama_sampler_init_top_n_sigma(float n) |
Keep tokens within n standard deviations
|
| Mirostat v1 |
llama_sampler_init_mirostat(int32_t n_vocab, uint32_t seed, float tau, float eta, int32_t m) |
Adaptive target-surprise sampling v1
|
| Mirostat v2 |
llama_sampler_init_mirostat_v2(uint32_t seed, float tau, float eta) |
Adaptive target-surprise sampling v2
|
| Penalties |
llama_sampler_init_penalties(int32_t last_n, float repeat, float freq, float present) |
Repeat/frequency/presence penalties
|
| DRY |
llama_sampler_init_dry(...) |
Don't Repeat Yourself anti-repetition
|
| Grammar |
llama_sampler_init_grammar(const llama_vocab * vocab, const char * grammar_str, const char * grammar_root) |
GBNF grammar-constrained sampling
|
| Logit Bias |
llama_sampler_init_logit_bias(int32_t n_vocab, int32_t n_logit_bias, const llama_logit_bias * logit_bias) |
Manual logit adjustments
|
| Infill |
llama_sampler_init_infill(const llama_vocab * vocab) |
Fill-in-the-middle optimized sampling
|
| Adaptive-P |
llama_sampler_init_adaptive_p(float target, float decay, uint32_t seed) |
Adaptive target probability sampling
|
Usage Examples
Greedy Sampling (from examples/simple/simple.cpp)
// Create a sampler chain with just greedy selection
auto sparams = llama_sampler_chain_default_params();
sparams.no_perf = false;
llama_sampler * smpl = llama_sampler_chain_init(sparams);
llama_sampler_chain_add(smpl, llama_sampler_init_greedy());
// Generation loop
for (int n_pos = 0; n_pos + batch.n_tokens < n_prompt + n_predict; ) {
if (llama_decode(ctx, batch)) {
fprintf(stderr, "failed to decode\n");
return 1;
}
n_pos += batch.n_tokens;
// Sample the next token (idx = -1 for last output position)
new_token_id = llama_sampler_sample(smpl, ctx, -1);
if (llama_vocab_is_eog(vocab, new_token_id)) {
break;
}
// Print the token
char buf[128];
int n = llama_token_to_piece(vocab, new_token_id, buf, sizeof(buf), 0, true);
std::string s(buf, n);
printf("%s", s.c_str());
fflush(stdout);
// Prepare next batch
batch = llama_batch_get_one(&new_token_id, 1);
n_decode += 1;
}
// Cleanup
llama_sampler_free(smpl);
Creative Sampling with Temperature and Top-p
llama_sampler * smpl = llama_sampler_chain_init(llama_sampler_chain_default_params());
// Apply penalties first (need full candidate list)
llama_sampler_chain_add(smpl, llama_sampler_init_penalties(64, 1.1, 0.0, 0.0));
// Filter candidates
llama_sampler_chain_add(smpl, llama_sampler_init_top_k(40));
llama_sampler_chain_add(smpl, llama_sampler_init_top_p(0.95, 1));
// Adjust distribution shape
llama_sampler_chain_add(smpl, llama_sampler_init_temp(0.8));
// Final probabilistic selection
llama_sampler_chain_add(smpl, llama_sampler_init_dist(LLAMA_DEFAULT_SEED));
llama_token token = llama_sampler_sample(smpl, ctx, -1);
Mirostat Sampling
llama_sampler * smpl = llama_sampler_chain_init(llama_sampler_chain_default_params());
// Mirostat v2 is a selecting sampler (should be the only/last sampler)
llama_sampler_chain_add(smpl, llama_sampler_init_mirostat_v2(
42, // seed
5.0, // tau (target surprise)
0.1 // eta (learning rate)
));
llama_token token = llama_sampler_sample(smpl, ctx, -1);
Related Pages