Implementation:Ollama Ollama Llama Batch

Knowledge Sources	Ollama
Domains	Inference, Batching
Last Updated	2025-02-15 00:00 GMT

Overview

Implements batch allocation, validation, splitting, and management for processing groups of tokens through the model during inference.

Description

The llama_batch_allocr class handles input batch validation (checking token IDs, sequence IDs, positions), auto-generation of missing metadata (positions from memory state, default sequence IDs, output flags), and splitting large batches into smaller llama_ubatch chunks that fit within compute limits. Supports multiple splitting strategies: simple (sequential sub-batches), equal (equal-sized sequence groups), and per-sequence splitting. Tracks per-sequence position sets and sequence coupling information for correct multi-sequence handling.

Usage

Used internally by llama_context during encode/decode to manage input batches. Proper splitting ensures batches fit within hardware memory constraints while maintaining correct sequence state.

Code Reference

Source Location

Repository: Ollama
File: llama/llama.cpp/src/llama-batch.cpp
Lines: 1-917

Signature

llama_batch_allocr::llama_batch_allocr(uint32_t n_pos_per_embd);

bool llama_batch_allocr::init(
    const llama_batch & batch_inp,
    const llama_vocab & vocab,
    const llama_memory_i * memory,
    uint32_t n_embd,
    uint32_t n_seq_max,
    bool output_all);

const llama_batch & llama_batch_allocr::get_batch() const;
uint32_t llama_batch_allocr::get_n_tokens()  const;
uint32_t llama_batch_allocr::get_n_outputs() const;

void llama_batch_allocr::split_reset();
llama_ubatch llama_batch_allocr::split_simple(uint32_t n_ubatch);
llama_ubatch llama_batch_allocr::split_equal(uint32_t n_ubatch, bool sequential);
llama_ubatch llama_batch_allocr::split_seq(uint32_t n_ubatch);

Import

#include "llama-batch.h"

I/O Contract

Inputs

Name	Type	Required	Description
batch_inp	const llama_batch &	Yes	User-provided input batch
vocab	const llama_vocab &	Yes	Vocabulary for token validation
memory	const llama_memory_i *	No	Memory system for position tracking
n_ubatch	uint32_t	Yes	Maximum micro-batch size for splitting

Outputs

Name	Type	Description
ubatch	llama_ubatch	Split micro-batch ready for compute graph
n_tokens	uint32_t	Total number of tokens in the batch
n_outputs	uint32_t	Number of output positions in the batch

Usage Examples

#include "llama-batch.h"

// Create batch allocator
llama_batch_allocr allocr(1); // 1 position per embedding

// Initialize from user input
allocr.init(batch, vocab, memory, n_embd, n_seq_max, false);

// Split into micro-batches
allocr.split_reset();
while (true) {
    llama_ubatch ubatch = allocr.split_simple(n_ubatch);
    if (ubatch.n_tokens == 0) break;
    // Process ubatch through compute graph
}

Related Pages

Principle:Ollama_Ollama_Inference_Pipeline

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment