Implementation:Ggml org Llama cpp Batch

Knowledge Sources	Ggml_org_Llama_cpp
Domains	Batch_Processing
Last Updated	2026-02-15 00:00 GMT

Overview

Implements batch allocation, validation, and splitting logic for processing token batches during inference in llama.cpp.

Description

The `llama_batch_allocr` class validates input batches (checking token IDs against vocabulary size, validating sequence IDs), auto-generates missing fields (positions, sequence IDs, output flags), tracks per-sequence position sets, and provides three splitting strategies: `split_simple` (arbitrary token groups), `split_equal` (equal-length sequence sets for efficient batched processing), and `split_seq` (one sequence-set per ubatch). Builds `llama_ubatch` objects that hold the actual data pointers consumed by the compute graph.

Usage

This is a core internal component that mediates between the user-facing `llama_batch` API and the internal `llama_ubatch` format. It is used automatically during `llama_decode()` and `llama_encode()` calls to prepare token data for the compute graph and KV cache.

Code Reference

Source Location

Repository: Ggml_org_Llama_cpp
File: src/llama-batch.cpp
Lines: 1-917

Signature

class llama_batch_allocr {
public:
    llama_batch_allocr(uint32_t n_pos_per_embd);

    // Initialize and validate a batch
    bool init(
        const llama_batch & batch_inp,
        const llama_vocab & vocab,
        const llama_memory_i * memory,
        uint32_t n_embd,
        uint32_t n_seq_max,
        bool output_all);

    // Splitting strategies
    llama_ubatch split_simple(uint32_t n_ubatch);
    llama_ubatch split_equal(uint32_t n_ubatch);
    llama_ubatch split_seq(uint32_t n_ubatch);

    // State queries
    bool get_ubatch(llama_ubatch & ubatch) const;
    int64_t n_tokens() const;
    void clear();

private:
    uint32_t n_pos_per_embd;
    int debug;
    llama_batch batch;
    const llama_vocab * vocab;
    std::vector<std::set<llama_pos>> seq_pos;
    std::vector<std::vector<bool>> seq_cpl;
    std::vector<int32_t> seq_idx;
    // ... additional internal state
};

Import

#include "llama-batch.h"
#include "llama-impl.h"
#include "llama-vocab.h"
#include "llama-memory.h"
#include <cassert>
#include <cstring>
#include <algorithm>
#include <sstream>

I/O Contract

Inputs

Name	Type	Required	Description
batch_inp	llama_batch	Yes	User-provided batch containing tokens, positions, sequence IDs, and output flags
vocab	llama_vocab	Yes	Vocabulary reference for token ID validation
memory	llama_memory_i*	No	Memory interface for sequence position tracking
n_embd	uint32_t	Yes	Embedding dimension size
n_seq_max	uint32_t	Yes	Maximum number of sequences allowed
output_all	bool	No	Whether to mark all tokens for output (overrides per-token logits flag)
n_ubatch	uint32_t	Yes	Maximum number of tokens per micro-batch for splitting

Outputs

Name	Type	Description
ubatch	llama_ubatch	Micro-batch with data pointers ready for the compute graph
success	bool	Whether batch initialization and validation succeeded
n_tokens	int64_t	Total number of tokens in the validated batch

Usage Examples

// Internal usage within llama_context::decode()
llama_batch_allocr batch_allocr(n_pos_per_embd);

// Initialize with user batch
bool ok = batch_allocr.init(batch, vocab, memory, n_embd, n_seq_max, output_all);
if (!ok) {
    return -1; // validation failed
}

// Split into micro-batches and process
while (batch_allocr.n_tokens() > 0) {
    llama_ubatch ubatch = batch_allocr.split_equal(n_ubatch);
    // Process ubatch through compute graph
    process_ubatch(ubatch);
}

Related Pages

Principle:Ggml_org_Llama_cpp_Batch_Processing

Page Connections

Double-click a node to navigate. Hold to expand connections.

Principle

Implementation

Heuristic

Environment