Implementation:Ollama Ollama Llama Sampling

Knowledge Sources	Ollama
Domains	Sampling, Inference
Last Updated	2025-02-15 00:00 GMT

Overview

Implements the high-level sampling pipeline that chains together multiple sampling strategies (top-k, top-p, temperature, penalties, grammar) for token selection during LLM inference.

Description

The common_sampler struct wraps a llama_sampler chain built from configured parameters. common_sampler_init constructs the chain by adding samplers in order (penalties, top-k, typical-p, top-p, min-p, XTC, temperature, distribution sampler). Uses a ring_buffer to track the last N accepted tokens for repeat penalty computation. common_sampler_sample applies the chain to logits and optionally re-samples with grammar constraints if the initially sampled token violates them. Also provides common_sampler_sample_and_accept_n for speculative decoding that cross-references sampled tokens against draft tokens.

Usage

Use this for all token sampling during inference. The sampling pipeline controls text generation quality and behavior through temperature, top-p, repetition penalties, grammar constraints, and other strategies.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/common/sampling.cpp
Lines: 1-654

Signature

template<typename T>
struct ring_buffer {
    ring_buffer(size_t cap);
    T & front();
    T & back();
    void push_back(const T & value);
    T pop_front();
    const T & rat(size_t i) const;
    std::vector<T> to_vector() const;
    void clear();
    bool empty() const;
    size_t size() const;
};

struct common_sampler {
    common_params_sampling params;
    struct llama_sampler * chain;
    bool grammar;
    ring_buffer<llama_token> prev;
    std::vector<llama_token_data> cur;
    llama_token_data_array cur_p;
    void reset();
    void set_logits(struct llama_context * ctx, int idx);
};

struct common_sampler * common_sampler_init(const struct llama_model * model,
                                            const struct common_params_sampling & params);

Import

#include "sampling.h"

I/O Contract

Inputs

Name	Type	Required	Description
model	const llama_model *	Yes	Model used to determine vocabulary size
params	common_params_sampling	Yes	Sampling configuration (temp, top_k, top_p, etc.)
ctx	llama_context *	Yes	Context with logits to sample from
idx	int	Yes	Token index in the batch to sample

Outputs

Name	Type	Description
token	llama_token	The sampled token
sampler	common_sampler *	Initialized sampler instance

Usage Examples

#include "sampling.h"

// Initialize sampler
common_params_sampling sparams;
sparams.temp = 0.8f;
sparams.top_p = 0.95f;
auto * smpl = common_sampler_init(model, sparams);

// Sample a token
llama_token token = common_sampler_sample(smpl, ctx, -1);

// Accept the token
common_sampler_accept(smpl, token, true);

// Clean up
common_sampler_free(smpl);

Related Pages

Principle:Ollama_Ollama_Sampling_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment