Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Principle:Ollama Ollama Llama Cpp Common Library

From Leeroopedia
Revision as of 17:54, 16 February 2026 by Admin (talk | contribs) (Auto-imported from principles/Ollama_Ollama_Llama_Cpp_Common_Library.md)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Knowledge Sources
Domains Utility Library, llama.cpp
Last Updated 2025-02-15 00:00 GMT

Overview

A common utility library for an ML inference engine provides shared functionality that is used across multiple components of the system, including token sampling strategies, chat template application, string utilities, logging infrastructure, and configuration parsing. This shared layer reduces code duplication and ensures consistent behavior across different consumers of the inference engine.

Core Concepts

Token Sampling Infrastructure

Token sampling is the process of selecting the next token from the probability distribution produced by the model's forward pass. A sampling library provides a composable chain of sampling operations that transform raw logits into a token selection. Common sampling operations include temperature scaling (dividing logits by a temperature parameter to control randomness), top-k filtering (keeping only the k most probable tokens), top-p/nucleus filtering (keeping the smallest set of tokens whose cumulative probability exceeds p), min-p filtering (removing tokens with probability below a minimum threshold relative to the top token), repetition penalty (reducing the probability of recently generated tokens), and frequency/presence penalties. The sampling chain is configurable per request, allowing different generation strategies for different use cases.

Sampling Chain Composition

A well-designed sampling infrastructure uses a chain-of-responsibility pattern where individual sampling operations are composed into a pipeline. Each operation in the chain receives the candidate token array, modifies it (filtering, reweighting, or reordering), and passes it to the next operation. The final operation selects a token from the remaining candidates (greedy selection, random sampling, or Mirostat adaptive sampling). This compositional design allows new sampling strategies to be added without modifying existing operations and enables runtime configuration of the sampling pipeline.

Chat Template Utilities

The common library often includes utilities for applying chat templates to conversation messages at the native level. This supplements or parallels the application-level template system, providing a C/C++ implementation that can be called directly from the inference engine. These utilities handle special token insertion, role prefix/suffix formatting, and template variable substitution, ensuring that prompt formatting is consistent whether performed at the application level or the engine level.

Cross-Cutting Utilities

Utility functions that serve multiple components include string manipulation (Unicode handling, token-to-text conversion, text normalization), logging infrastructure (configurable log levels, structured output, performance counters), configuration parsing (reading model metadata, parameter validation, default value resolution), and error handling (error code definitions, descriptive error messages, error propagation). These utilities are compiled into a shared library or static archive that is linked by all components of the inference system.

Grammar-Constrained Generation

Advanced sampling libraries support grammar-constrained generation, where the sampling process is guided by a formal grammar (typically BNF or a regex) that restricts output to syntactically valid strings. This is implemented by maintaining a grammar parser state that tracks the current position in the grammar, computing which tokens are valid continuations at each step, and masking out invalid tokens before sampling. Grammar-constrained generation is essential for structured output formats such as JSON, function call arguments, or code generation.

Implementation Notes

In the Ollama codebase, the common library corresponds to llama.cpp's common directory, which provides shared utilities used by the inference engine and its consumers. The sampling infrastructure implements a configurable chain of sampling operations (temperature, top-k, top-p, min-p, repetition penalty, Mirostat) that are composed based on per-request parameters passed from the Go layer through the CGo bridge. Chat template utilities provide C-level template application using the Jinja2-compatible template engine built into llama.cpp. Additional utilities include logging with configurable verbosity, model metadata parsing, and tokenizer utilities. The common library is compiled as part of the llama.cpp build and linked into the CGo bridge shared object.

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment