Implementation:Ollama Ollama Llama Batch
| Knowledge Sources | |
|---|---|
| Domains | Inference, Batching |
| Last Updated | 2025-02-15 00:00 GMT |
Overview
Implements batch allocation, validation, splitting, and management for processing groups of tokens through the model during inference.
Description
The llama_batch_allocr class handles input batch validation (checking token IDs, sequence IDs, positions), auto-generation of missing metadata (positions from memory state, default sequence IDs, output flags), and splitting large batches into smaller llama_ubatch chunks that fit within compute limits. Supports multiple splitting strategies: simple (sequential sub-batches), equal (equal-sized sequence groups), and per-sequence splitting. Tracks per-sequence position sets and sequence coupling information for correct multi-sequence handling.
Usage
Used internally by llama_context during encode/decode to manage input batches. Proper splitting ensures batches fit within hardware memory constraints while maintaining correct sequence state.
Code Reference
Source Location
- Repository: Ollama
- File: llama/llama.cpp/src/llama-batch.cpp
- Lines: 1-917
Signature
llama_batch_allocr::llama_batch_allocr(uint32_t n_pos_per_embd);
bool llama_batch_allocr::init(
const llama_batch & batch_inp,
const llama_vocab & vocab,
const llama_memory_i * memory,
uint32_t n_embd,
uint32_t n_seq_max,
bool output_all);
const llama_batch & llama_batch_allocr::get_batch() const;
uint32_t llama_batch_allocr::get_n_tokens() const;
uint32_t llama_batch_allocr::get_n_outputs() const;
void llama_batch_allocr::split_reset();
llama_ubatch llama_batch_allocr::split_simple(uint32_t n_ubatch);
llama_ubatch llama_batch_allocr::split_equal(uint32_t n_ubatch, bool sequential);
llama_ubatch llama_batch_allocr::split_seq(uint32_t n_ubatch);
Import
#include "llama-batch.h"
I/O Contract
Inputs
| Name | Type | Required | Description |
|---|---|---|---|
| batch_inp | const llama_batch & | Yes | User-provided input batch |
| vocab | const llama_vocab & | Yes | Vocabulary for token validation |
| memory | const llama_memory_i * | No | Memory system for position tracking |
| n_ubatch | uint32_t | Yes | Maximum micro-batch size for splitting |
Outputs
| Name | Type | Description |
|---|---|---|
| ubatch | llama_ubatch | Split micro-batch ready for compute graph |
| n_tokens | uint32_t | Total number of tokens in the batch |
| n_outputs | uint32_t | Number of output positions in the batch |
Usage Examples
#include "llama-batch.h"
// Create batch allocator
llama_batch_allocr allocr(1); // 1 position per embedding
// Initialize from user input
allocr.init(batch, vocab, memory, n_embd, n_seq_max, false);
// Split into micro-batches
allocr.split_reset();
while (true) {
llama_ubatch ubatch = allocr.split_simple(n_ubatch);
if (ubatch.n_tokens == 0) break;
// Process ubatch through compute graph
}