Implementation:Ollama Ollama Llama Sampling
| Knowledge Sources | |
|---|---|
| Domains | Sampling, Inference |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements the high-level sampling pipeline that chains together multiple sampling strategies (top-k, top-p, temperature, penalties, grammar) for token selection during LLM inference.
Description
The common_sampler struct wraps a llama_sampler chain built from configured parameters. common_sampler_init constructs the chain by adding samplers in order (penalties, top-k, typical-p, top-p, min-p, XTC, temperature, distribution sampler). Uses a ring_buffer to track the last N accepted tokens for repeat penalty computation. common_sampler_sample applies the chain to logits and optionally re-samples with grammar constraints if the initially sampled token violates them. Also provides common_sampler_sample_and_accept_n for speculative decoding that cross-references sampled tokens against draft tokens.
Usage
Use this for all token sampling during inference. The sampling pipeline controls text generation quality and behavior through temperature, top-p, repetition penalties, grammar constraints, and other strategies.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/common/sampling.cpp
- Lines: 1-654
Signature
template<typename T>
struct ring_buffer {
ring_buffer(size_t cap);
T & front();
T & back();
void push_back(const T & value);
T pop_front();
const T & rat(size_t i) const;
std::vector<T> to_vector() const;
void clear();
bool empty() const;
size_t size() const;
};
struct common_sampler {
common_params_sampling params;
struct llama_sampler * chain;
bool grammar;
ring_buffer<llama_token> prev;
std::vector<llama_token_data> cur;
llama_token_data_array cur_p;
void reset();
void set_logits(struct llama_context * ctx, int idx);
};
struct common_sampler * common_sampler_init(const struct llama_model * model,
const struct common_params_sampling & params);
Import
#include "sampling.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| model | const llama_model * | Yes | Model used to determine vocabulary size |
| params | common_params_sampling | Yes | Sampling configuration (temp, top_k, top_p, etc.) |
| ctx | llama_context * | Yes | Context with logits to sample from |
| idx | int | Yes | Token index in the batch to sample |
Outputs
| Name | Type | Description |
|---|---|---|
| token | llama_token | The sampled token |
| sampler | common_sampler * | Initialized sampler instance |
Usage Examples
#include "sampling.h"
// Initialize sampler
common_params_sampling sparams;
sparams.temp = 0.8f;
sparams.top_p = 0.95f;
auto * smpl = common_sampler_init(model, sparams);
// Sample a token
llama_token token = common_sampler_sample(smpl, ctx, -1);
// Accept the token
common_sampler_accept(smpl, token, true);
// Clean up
common_sampler_free(smpl);