Jump to content

Connect SuperML | Leeroopedia MCP: Equip your AI agents with best practices, code verification, and debugging knowledge. Powered by Leeroo — building Organizational Superintelligence. Contact us at founders@leeroo.com.

Implementation:Ggml org Llama cpp Batch

From Leeroopedia
Knowledge Sources
Domains Batch_Processing
Last Updated 2026-02-15 00:00 GMT

Overview

Implements batch allocation, validation, and splitting logic for processing token batches during inference in llama.cpp.

Description

The `llama_batch_allocr` class validates input batches (checking token IDs against vocabulary size, validating sequence IDs), auto-generates missing fields (positions, sequence IDs, output flags), tracks per-sequence position sets, and provides three splitting strategies: `split_simple` (arbitrary token groups), `split_equal` (equal-length sequence sets for efficient batched processing), and `split_seq` (one sequence-set per ubatch). Builds `llama_ubatch` objects that hold the actual data pointers consumed by the compute graph.

Usage

This is a core internal component that mediates between the user-facing `llama_batch` API and the internal `llama_ubatch` format. It is used automatically during `llama_decode()` and `llama_encode()` calls to prepare token data for the compute graph and KV cache.

Code Reference

Source Location

Signature

class llama_batch_allocr {
public:
    llama_batch_allocr(uint32_t n_pos_per_embd);

    // Initialize and validate a batch
    bool init(
        const llama_batch & batch_inp,
        const llama_vocab & vocab,
        const llama_memory_i * memory,
        uint32_t n_embd,
        uint32_t n_seq_max,
        bool output_all);

    // Splitting strategies
    llama_ubatch split_simple(uint32_t n_ubatch);
    llama_ubatch split_equal(uint32_t n_ubatch);
    llama_ubatch split_seq(uint32_t n_ubatch);

    // State queries
    bool get_ubatch(llama_ubatch & ubatch) const;
    int64_t n_tokens() const;
    void clear();

private:
    uint32_t n_pos_per_embd;
    int debug;
    llama_batch batch;
    const llama_vocab * vocab;
    std::vector<std::set<llama_pos>> seq_pos;
    std::vector<std::vector<bool>> seq_cpl;
    std::vector<int32_t> seq_idx;
    // ... additional internal state
};

Import

#include "llama-batch.h"
#include "llama-impl.h"
#include "llama-vocab.h"
#include "llama-memory.h"
#include <cassert>
#include <cstring>
#include <algorithm>
#include <sstream>

I/O Contract

Inputs

Name Type Required Description
batch_inp llama_batch Yes User-provided batch containing tokens, positions, sequence IDs, and output flags
vocab llama_vocab Yes Vocabulary reference for token ID validation
memory llama_memory_i* No Memory interface for sequence position tracking
n_embd uint32_t Yes Embedding dimension size
n_seq_max uint32_t Yes Maximum number of sequences allowed
output_all bool No Whether to mark all tokens for output (overrides per-token logits flag)
n_ubatch uint32_t Yes Maximum number of tokens per micro-batch for splitting

Outputs

Name Type Description
ubatch llama_ubatch Micro-batch with data pointers ready for the compute graph
success bool Whether batch initialization and validation succeeded
n_tokens int64_t Total number of tokens in the validated batch

Usage Examples

// Internal usage within llama_context::decode()
llama_batch_allocr batch_allocr(n_pos_per_embd);

// Initialize with user batch
bool ok = batch_allocr.init(batch, vocab, memory, n_embd, n_seq_max, output_all);
if (!ok) {
    return -1; // validation failed
}

// Split into micro-batches and process
while (batch_allocr.n_tokens() > 0) {
    llama_ubatch ubatch = batch_allocr.split_equal(n_ubatch);
    // Process ubatch through compute graph
    process_ubatch(ubatch);
}

Related Pages

Page Connections

Double-click a node to navigate. Hold to expand connections.
Principle
Implementation
Heuristic
Environment